Hacker News new | past | comments | ask | show | jobs | submit login
Llama.cpp: Port of Facebook's LLaMA model in C/C++, with Apple Silicon support (github.com/ggerganov)
989 points by mrtksn on March 10, 2023 | hide | past | favorite | 284 comments



A quick survey of the thread seems to indicate the 7b parameter LLaMA model does about 20 tokens per second (~4 words per second) on a base model M1 Pro, by taking advantage of Apple Silicon’s Neural Engine.

Note that the latest model iPhones ship with a Neural Engine of similar performance to latest model M-series MacBooks (both iPhone 14 Pro and M1 MacBook Pro claim 15.8 teraflops on their respective neural engines; it might be the same exact component in each chip). All iPhone 14 models sport 6GB integrated RAM; the MacBook starts at 8GB. All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.

Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.

So. With Whisper and LLaMA on the Neural Engine both showing better than real-time performance, and Apple’s own pre-existing Siri Neural TTS, it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone. This is absolutely extraordinary stuff!


> All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.

Battery capacity and thermals are different and might be problematic. The phone might throttle performance earlier.

> it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone.

As a demo, yes, but would loading the model be fast enough for Siri-like responsiveness? You also would want to run other programs alongside it.

And of course, for Apple to adopt something like this, we would have to get rid of the tendency of these models to derail conversations. Put in something somewhat sexist/racist/…, and it will reply with something a bit more sexist/racist/…)

But yes, it would be a cool demo.


> we would have to get rid of the tendency of these models [...] reply with something a bit more sexist/racist/

If you don't want it to be racist, don't say racist things to it. Also, it'll be fairly clear where the racism came from - like a parrot and their owner.

AIs that can tweet, like MS Tay, and that remote-work chatbot, get a lot of attention when they melt down. Private AIs on your phone don't seem like they'll caise any concern with the phone-using public.

I think we'll appreciate the benefits more than we'll mind that others can make it say dirty words.


Siri doesn’t seem as fast or responsive compared to Google assistant at times.


At this point in time, Siri as a voice-driven assistant has become so totally and utterly useless, its not even worth comparing it to anything else. I wonder how a company can work at a feature like that for 10 years, and manage to make it worse with every release they put out.

At this point in time, Apple should be so embarrased of Siri that I really think scratching the whole thing would have a net benefit.

Scratch it, and start over. And fire everyone involved with Siri :-)


The logistics aren't that easy, Apple's entire product line runs Siri.


Siri is sometimes busy doing laundry or Gods know what. I think the quality of Siri is much better than Google Assistant but I wonder about the lag.


Really? I find Siri can’t understand anything slightly more than basic instructions.

Google assistant can seem to do more


I'm very interested in this space. Can you share an example that illustrates the difference in "understanding" between the two?


Just recently Siri would belly-up on “Turn off Living Room lightS” — it would only work if I said “light” (singular). Extremely frustrating. They fixed it, I think, but this arbitrariness and many other make me think Siri is more quirk- and algorithms-based than a true AI.


Handling smart home requests is the one thing that Siri seems to do more or less without error, at least for me. I use that multiple times per day per day, and cannot remember the last time that it did not work.


Is Siri better, or does it have you well trained? My smart home stuff works best for me because I know more of the exact labels. I was literally surprised the other day that my wife included an S and it still worked.


Mine is really really poor at it.

Half the time it responds with "one moment.. One moment.. this is taking too long" or "I have problems connecting to the internet". But there's no internet problems whatsoever and it connects to my home Assistant using local homekit integration which shouldn't even need that.


> Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.

He has already done great work here: https://github.com/ggerganov/whisper.cpp


If I may, this library runs LLaMA on CPU. There is no way to run it on the Neural Engine yet.

The optimization in this case only seems to refer to the 4bit model loading method (to be friendlier to the arm64 CPU)

GeoHot has tinygrad running LLaMA on Metal (but only the 7B model) that's the closest I've seen to taking advantage of apple silicon.

Neural Engine implementation would be awesome


Oh shit, I took a closer look and you’re right. The repo was also helpfully updated with a note to this effect: “The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder, there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't know how to utilize it properly. But in any case, you can even disable it with LLAMA_NO_ACCELERATE=1 make and the performance will be the same, since no BLAS calls are invoked by the current implementation”.

No Joi in my pocket just yet :(

Because of this I re-checked my claims about the Whisper speed up from the Neural Engine and that does look legit, 6x at least. So the Neural Engine does have the chops for this workload, it just isn’t being used in this repo. It may not be LLaMA, but I sure hope someone gets an LLM running on the ANE sooner rather than later.


Our investigations indicate that it might not be possible to achieve ANE performance improvement over CPU for LLM Decoder inference with batch size of 1 [0]. Just to make it clear - I'm no expert in Core ML / ANE, so these conclusions could be totally wrong.

[0] https://github.com/ggerganov/whisper.cpp/discussions/548#dis...


Don’t sell yourself short! (And you have my apologies in advance if my excited comment above has created any extra work for you)


Neural Engine across the M1 and M2 series is also sadly very limited.

I bought one thinking I could exploit it for StableDiffusion and other tasks but found that most libraries say to use GPU for faster generation. What I found is not only is the engine the same on m2 pro (meaning I upgraded for no reason from my m1 basemodel) but it also doesn't scale at all except in the m1 Ultra where it's doubled simply because it's using two dies bridged.

Neural Engine can generate 512x512 images pretty easily but takes a while even compared to using the GPU on a basemodel m1 Mac Mini. It's kinda crazy. Looking into ways to improve it and take advantage of the neural engine in the future but the current situation is very limited. Even apples official implementation and coreML libraries seem to prefer you run them on Metal


The 7b model specifically is not quite "ChatGPT-level" though, is it?


According to Meta's benchmarking[0] it is comparable on many metrics. I haven't used it myself so I can't say for sure if that is the case when actually using it.

[0]: https://arxiv.org/pdf/2302.13971.pdf


That's GPT3, not ChatGPT.


I don't understand this topic well, but given premise that GPT3 and ChatGPT are different only that ChatGPT includes RLHF(Reinforcement Learning from Human Feedback), and LLaMA 7b is comparable to GPT3 on a number of metrics, it would follow that if we were to improve LLaMA 7b with RLHF, the 7b model would be similar to ChatGPT. Is that correct?


You're likely right that applying RLHF (+ fine-tuning with instructions) to LLaMA 7b would produce results similar to ChatGPT, but I think you're implying that that would be feasible today.

RLHF requires a large amount of human feedback data and IIRC there's no open data set for that right now.


There's open-assistant.io, which is doing RLHF directly on the open


And they've already collected over 100,000 samples, iirc ChatGPT was trained on something like 30,000 samples, so the open models should already be positioned to succeed.


There are open datasets (see the chatllama harness project and its references). You can of course also cross train it using actual ChatGPT.


Is there something I'm missing? ChatLlama doesn't reference any human feedback datasets.

> You can of course also cross train it using actual ChatGPT.

You mean train it on ChatGPT's output? That's against OpenAI's terms of service.


> You mean train it on ChatGPT's output? That's against OpenAI's terms of service.

Oh no, someone call the internet police.

I'm sure scraping tons and tons of images and web data to train DALLE and GPT and then selling access to that data to others was also against many licenses and terms of services, but OpenAI did those anyway.


None of these AIs were created ethically. At the very least we can make sure these huge models don’t solely belong to monopolistic tech companies and democratize their power.


You’re missing something. Both SHP (https://huggingface.co/datasets/stanfordnlp/SHP) and OpenAssistant datasets are referenced.

And the TOS violation might be the case, the project nevertheless has a mode to use OpenAI in the fine tuning steps.


I’m interested in this as well. Comparatively little attention has been paid to those 7B model results, but they look quite good against 175B GPT-3.

As for ChatGPT, that is GPT-3.5 (same 175B model, but with instruction fine-tuning), plus the RLHF.


GPT 3.5 likely differs from the original GPT 3 by more than instruction fine-tuning. For example, it was probably retrained under Chinchilla scaling laws [1], with a lot more data and maybe a somewhat smaller parameter count.

There are many variants of GPT-3 and GPT-3.5, and based on the performance numbers in Meta’s paper, it looks like they’re comparing against the very first version of GPT-3 from 2020. [2]

[1] https://arxiv.org/abs/2203.15556

[2] https://arxiv.org/abs/2005.14165


There's no overhead introduced for the 'final' model inference, is there?


None of the Meta models are RLHF tuned, as far as I know.


I wish we could start having open source TTS models with similar performance. So far Tortoise TTS is not there yet. Im not sure if Siri neural TTS is offered for 3rd party apps.


>20 tokens per second (~4 words per second)

How can there be 5 tokens per word, when they have more than half the vocabulary as GPT-2/3 which has 1.3 tokens per word?

I would have guessed more like 1.5 tokens per word.


Oh, it’s probably higher than four words per second, then. I assumed tokens was characters and used the standard “there are five characters in a word” rule of thumb.


It's about 4 charcters per token. So just over 1 token per word. I just round to 1 token per word since text most people generate does not use larger words and because larger common words are still encoded as one token (e.g. HackerNews is probably one token despite being 10 characters).


I typically see people claim 2-3 tokens per word.


But wont it be that in real life no one would want to run a voice command which consumes lot of CPU and battery as opposed to making a network call to a service which has this model hosted ?

Agreed that this can always be improved and hardware can get more efficient and better to but at the end of the day, would it ever be better then an API call ?


I live in eastern Oregon on a property with no cell service.

I use Siri a lot, mainly to add reminders, and sometimes I try to use Siri when I'm out at the greenhouse, which is just past the edge of the mesh network. I would love for those reminders to get added - even if it burnt battery.

And more generally I would love for people writing apps to consider that phones don't always have service - as would my neighbors.


Privacy concerns are justified.

It's not just that, this can also work completely offline.


I'm looking forward to run stuff like this online. Using bigtech corporate souls SaaS AI is just pure dystopia material.

It's even better that we are talking about a relatively low power machine here. Maybe can operate offered.


You mean offline?


Ultimately, no amount of technology will ever beat the speed of light. Running locally will always have a lower latency floor.


Theoretically yes. But in the real world, no.

Simple thought experiment: you want to know how many tons of copper are mined in the US each year. Lowest possible latency is calculating this in your head, most likely using data you don’t have. Looking it up online is a lot, lot faster.

In some far future world maybe every transistor will include the sum total of human knowledge up to the nanosecond, but that’s a pretty far future. There are many things where running locally means a higher latency floor.


Its still cheaper to run a free model on a competitive "dumb" cloud host than buy a service only one company provides.


There are still a few people in the world who don't have always-on gigabit internet access everywhere they go.


"There is No Reason for Any Individual To Have a Computer in Their Home"


I would also expect 10x improvements over the next year due to optimizations found throughout the stack.


the potential drawbacks of relying entirely on voice-operated assistants like ChatGPT. There are concerns around privacy and the use of personal data, as well as the potential for bias and inaccuracies in the responses generated by these models. It's important to strike a balance between the convenience and benefits of these technologies and the potential risks and limitations they bring. Nonetheless, the advancements being made in this field are impressive and it will be interesting to see how they develop in the future.


That's very ChatGPT of you to say!


i think voice assistants can perform actions on phones (eg "open app, message Alice, call Bob, turn off Bluetooth"). This couldn't do that (I think), which is an obvious drawback


4 words a second doesn't seem fast enough for a voice assistant ?


It's faster than that [0], 20 token/s, should be approximately 15 words per second.

0: https://help.openai.com/en/articles/4936856-what-are-tokens-...


I've had difficulty obtaining useful results from the smaller (7B-sized) models. The issue lies in the content, not the speed. If you could stream the text-to-speech, the speed alone would be satisfactory.


You're right I overestimated how fast we talk!


Some rules of thumb I use for estimating this kind of stuff

100wpm: Max typing speed

200wpm: Max speaking speed

300wpm: Max listening speed, max reading speed with subvocalisation

900wpm: Max reading speed without subvocalisation


Doin napkin math, this model should be hitting 900wpm


If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.

For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.

Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.

I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.


My code for this is very much not high quality, but I have a CPU + GPU + SSD combination: https://github.com/gmorenz/llama/tree/ssd

Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...

At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.

At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.

You could probably optimize quite a bit for batch throughput if you're ok with the latency though.


Yeah, it does seem like there's a fundamental limit how fast you can go even if you engineer the data juggling to perfection. My guess is that every loop through the transformer is going to have to visit every weight and if those weights cannot fit in your fastest memory, then it's going to have to spend time transferring data from SSD or whatever is lower in your memory hierarchy.

The quantization used in the post luckily seems to work somewhat well; I'm also wondering if some new clever ways will be invented that reduce the amount of data you need to juggle. Maybe e.g. not just using 4-bit weights but also compressing them in some way, sorting the weights or something.


Huffman encoding the weights (treating each 16bit float a symbol) could reduce the weights size to ~85% the original (I calculated this exactly before, but am going from memory). You could maybe get a bit more than that with arithmetic encoding (if you managed to decode fast enough), but it shouldn't be that much more.

Once you start including lossy steps like quantization though it's much less clear. At some point you just reach "knowledge distillation is an open problem".


Perhaps there is an instance of Amdahl's law lurking the the midst?


Won't the 65b model (almost) fit into 128GB RAM? Or into 128GB RAM and 24GB VRAM?


LLaMA-65B fits in 32GB of VRAM using state of the art GPTQ quantization with no output performance loss.

https://github.com/qwopqwop200/GPTQ-for-LLaMa


So if I'm reading this right, 65B at 4bit would consume around 20GB of VRAM and ~130GB of system RAM?


LLaMA it doesn't require any system RAM to run.

It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.

But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)


The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.


Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.

I would not personally call compilation of software part of its "use case." It's use case is text generation.


Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.

Or it is probably possible to make it work slowly using a swapfile on Linux.


Closer to 38-40GB VRAM (and hardly any RAM).


Yes (I just don't have that much ram)

I have a separate branch that streams weights from ram - at which point I think I was only seeing negligible performance loss compared to storing the weights in vram. The bottleneck was compute, not GPU bandwidth.


The 65B model only needs just over 32GB of VRAM to run. It does not need system RAM to run/use if you use pre-quantized weights which you can find many places already.

No need to quantize yourself (besides it takes almost a day to do 4bit GPTQ quantization on 3xA6000).


Quantizing is a lossy process, you can't really claim to be running the 65B model llama at that point (though the 65b qgpt-llama does look like it might be very useful)


The GPTQ paper https://arxiv.org/abs/2210.17323 claims "negligible accuracy degradation relative to the uncompressed baseline".


Are you sure? I think it took a mere 2 hours to do 4bit GPTQ quantization of LLaMA-65B on 1x RTX 3090. But i may be mistaken.


I noticed the Fasttext code was also surprisingly clean and readable C++. whatever moralities and other flaws the metal business model might have in general, they seem to have a consistently excellent track record when it comes to publicly available libraries and tools.


Very nice post, good lead. It makes me curious... I wonder what LLaMA would look like implemented upon the newly release OpenXLA[1]! Is that even a sensible ask? I feel like it could potentially be an informative exercise, that would aid in the understanding of the landscape of tooling.

[1] https://opensource.googleblog.com/2023/03/openxla-is-ready-t... https://news.ycombinator.com/item?id=35078410


We have it running as part of SHARK (which is built on IREE). https://github.com/nod-ai/SHARK/tree/main/shark/examples/sha...


I’m pretty sure the code you linked is just simplified for publication. I think it’s interesting to read, I just don’t think it’s what they actually used to train and develop the algorithm.


This is only the model code which defined the shape and how to do a forward pass.

It isn't the training code, but it would be unlikely that the model code used then is any different.


There are little hints strewn out through the code that suggests it is indeed “trimmed” from a larger codebase.


tinygrad by geohot, also linked on this thread, has similar properties good for learning - it's a couple hundred LoC to integrate LLaMA support

https://github.com/geohot/tinygrad/tree/llama


Just don’t copy their sampler. It’s trashcan-tier. Bad defaults and no repetition penalty.


I think tuning the sampler temperature and using top-k over top-p sounds ad hoc and shouldn’t be necessary for a solid model. Do you have any reason for suggesting those changes in particular? Especially since top-p, or nucleus sampling, is meant to be an improvement over top-k.


Relevant: Since LLaMA leaked on torrent, it has been converted to Huggingface weights and it has been quantisized to 8bit for less vram requirements.

A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.

4 bit weights:

Model, weight size, vram req.

LLaMA-7B, 3.5GB, 6GB

LLaMA-13B, 6.5GB, 10GB

LLaMA-30B, 15.8GB, 20GB

LLaMA-65B, 31.2GB, 40GB

Here is a good overall guide for Linux and Windows:

https://rentry.org/llama-tard-v2#bonus-4-4bit-llama-basic-se...

I also wrote a guide how to get the bitsandbytes library working on windows:

https://github.com/oobabooga/text-generation-webui/issues/14...


George Hotz already implemented LLaMA 7B and 15B on Twitch yesterday on GPU in Tunygrad llama branch:

https://github.com/geohot/tinygrad/tree/llama

The only problem is that it's swapping on 16GB Macbook, so you need at least 24GB in practice.


There are two coding sessions streamed so far, very interesting to watch

George Hotz | Programming | can we fit a LLaMA inside a tinygrad? https://www.youtube.com/watch?v=0kRDs9BW2NU

George Hotz | Programming | ChatLLaMA: get in losers we're building a chatbot https://www.youtube.com/watch?v=nctqc8FBJ2U


Both are very impressive. A nice thing about Gerganov's implementation is that it is written in almost pure C. Arguably easier for deployment.


What does almost mean in this case?


There is also C++


Iow it probably wouldn’t compile with an actual C only compiler, but by and large it looks more like C than like C++?


There is also a gpu-acelerated fork of the original repo

https://github.com/remixer-dec/llama-mps


> For 7B model, it always goes above 32gb of RAM,

That's double of what Tinygrad uses


Tinygrad is using openCL right?


It's not tinygrad really, it is PyTorch.


Unfamiliar with the domain, but is it? Looking through the changes (https://github.com/geohot/tinygrad/compare/master...llama) it seems to not mention pytorch at all but tinygrad + numpy.


Can you link to the Twitch stream VoD?


https://www.twitch.tv/georgehotz/videos?filter=archives&sort...

although, there is a VOD channel on YT that might be better.


why does it say video unavailable?


Hmm. I'm not sure. Presumably this will be updated soon: https://www.youtube.com/@geohotarchive/videos


I'm pretty sure he only makes his past twitch streams available to subscribers. It's weird they are available on Youtube, maybe he doesn't know that's something he can change.


I don't think he runs the YouTube channel - it's managed by some of his fans. But like what's he gonna do about it? Send them a takedown notice?


Not sure about his more recent content but he used to have a policy that the VODs could be uploaded in full as long as they were uncut and not modified. If you go watch some older VODs he says so at the beginning.


This is such a refreshing and neat way to use Twitch.


that is so impressive!


I got this working on my 64GB M2 MacBook Pro!

Wrote detailed notes here for anyone else who wants to try this: https://til.simonwillison.net/llms/llama-7b-m2


I have the 64GB M2 Max I will have to try this out, its amazing that this laptop can do things my high end desktop can't do, Apple really made some good architectural choices with apple silicon.


Thank you for these helpful notes!!


Amazing. Thank you very much !


Thank you!


Super cool project. This is from the author of whisper.cpp, which enables highly accurate real-time audio transcription on the M1/M2:

https://github.com/ggerganov/whisper.cpp


Georgi just added support for all models (13B/33B/65B) [0]

LLaMA 65B can do ~2 tokens per second on my M1 Max / 64 gb ram [1]

[0] https://twitter.com/ggerganov/status/1634488664150487041 [1] https://twitter.com/lawrencecchen/status/1634507648824676353


Very cool. I've seen some people running 4-bit 65B on dual 3090s, but didn't notice a benchmark yet to compare.

It looks like this is regular 4-bit and not GPTQ 4-bit? It's possible there's quality loss but we'll have to test.

>4-bit quantization tends to come at a cost of substantial output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit) quantization methods and even when compared with uncompressed fp16 inference.

https://github.com/ggerganov/llama.cpp/issues/9


On my M1 Ultra LlaMA 65B generates ~3 tokens per second (using 16 threads).


Could someone with experience explain: what's the theoretical minimum hardware requirement for llama 7B, 15B, etc, that still provides output on the order of <1sec/token?

It seems like we can pull some tricks, like using F16, and some kind of quantization, etc.

At the end of the day, how much overhead is left that can be reduced? What can I expect to have running on 16gb ram with a 3080 and a midrange AMD processor?


The 4-bit GPTQ LLaMA models are the current top-performers. This site has done a lot of the heavy lifting: https://github.com/qwopqwop200/GPTQ-for-LLaMa

With 30b-4bit on a RTX 4090, I'm seeing numbers like:

Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens)

Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens)

Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens)

Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens)

The lower size (7b, 13b) are even faster with lower memory use. A 16GB 3080 should be able to run the 13b at 4-bit just fine with reasonable (>1 token/s) latency.


With a recent PR, text-generation-webui makes this very easy to use https://github.com/oobabooga/text-generation-webui/pull/206


And on an M1?

I have 64gb available


Well I was able to run the original code with the 7B model on 16GB vram: https://news.ycombinator.com/item?id=35013604

The output I got was underwhelming, though I did not attempt any tuning.


The author just made an update that makes the generation much better, even with the 7B model:

https://twitter.com/ggerganov/status/1634310199170179075

I tried it out myself (git pull && make) and the difference in results are day and night! It's amazing to play with, although you should prompt it differently than ChatGPT (more like the GPT-3 API).


parameter tuning is pretty necessary, according to anecdotes. People on twitter have got good results by changing the default parameters.


For 13b and 30b, it really needs high temperature to produce good outputs.


16GB of vram can run the 7B for sure, I'm not sure what the most cutting-edge memory optimization but the 15B is going to be pretty tight I'm not sure that'll fit with what I know of at least, I've got it working at a bit over 20gb of vram I think at 8bit.

If you can't fit it all in vram you can still run it but it'll be slooooow, at least that's been my experience with the 30b.


At 4 bits the 13B LLaMa model can run on a 10GB card!


Interesting. But how about the Apple Neural Engine (ANE)? I've always wondered if the ANE is ML worthy, maybe it's really only with inference or who knows, even training somehow. I've seen Apple's marketeers bragging about it [1], with even code examples, but ifaik no useful libraries nor reliable measurements and community interest exist in the wild for doing ANE ML on Macs.

1. https://machinelearning.apple.com/research/neural-engine-tra...

Edit: just found this: https://github.com/ggerganov/whisper.cpp/pull/566


Absolutely love ggerganov's approach with models like this and Whisper. It's just awesome being able to experiment with (what I consider) complex models without needing a billion python/c/cpp dependencies!


The performance on Macbook with M1 Pro is said to be 20 tokens/s

https://twitter.com/ggerganov/status/1634282694208114690


A Macbook Pro M1 would have the base M1 CPU while he was referring to the M1 Pro CPU in something like a Macbook Pro w/ M1 Pro. It's confusing naming by Apple.


right, fixed it


This is faster than running it on an RTX 4090 I think.


I get 32 tokens/sec on a 4090 using GPTQ 4bit with streaming off, with the model 5x larger than that.

So nowhere close to the 4090, but plenty fast anyway.


Nope a 4090 can do the 30b-4bit model at 20 tokens/s


Insanity! This is the same guy who wrote Whisper C++! How does he do this? I feel like I am a side character in some cartoon gasping at the unthinkable power level of the main character.


What’s even more impressive is this part

> This was hacked in an evening


That's all fine and good. But to do anything useful, you're going to want a powerful GPU (RTX 3090, RTX 4090 or A6000) with as much VRAM as possible. Unlike the diffusion models, LLM's are very memory-intensive, even at 4-bit GPTQ. The larger models like llama-13b and llama-30b run quite well at 4-bit on a 24GB GPU. The llama-65b-4bit should run on a dual 3090/4090 rig.

Coupled with the leaked Bing prompt and text-generation-webui, the results are quite impressive.


Macs have UMA so an off the shelf Mac can use up to about 120GB of vram. Far more than any consumer card, more than the biggest a100.

GPU power is lower, of course, but pure vram is not a problem.


VRAM is the thing that Apple Silicon is going to have in excess compared to anything even close in price. MacBook Airs can have 14-15GB of VRAM if necessary.


RAM read bandwidth of M1/M2 is still not really competitive with the large GPUs like RTX3090, but it's getting close, compared with the usual CPU setups.


Is there an upper limit on the usable VRAM on Airs, or is it just tied to RAM size minus a bit for the OS? Just got a 24GB Air M2 and your comment made me wonder if it was capped at n amount of VRAM, or if it's just that Air's now can have 24GB of ram compared to 16GB with the M1 Airs?


Apple Silicon uses unified memory so laptops have up to 64GB of VRAM and the Max sitio can have up to 128 GB.

That makes them uniquely “powerful” for inference with large models.


Nowadays up to 96GB on the laptops.


Here is some quick math: these devices has SSD read speed somewhere around 2GiB/s. With 4-bit quantization, we are looking at loading 4B parameters per second. That means we need 8s per token for 30B model. Hmmm, the math is a bit off (or I need to look closer whether we can do more tokens per iteration with some batching).


You can use Speculative Sampling, where a draft model is used to generate short continuations of a sequence, which are scored in parallel by the large model. The draft model can be small, and you only need to call the large model from time to time, so you can stream it from SSD or cheap RAM.

Using LRDIMM DDR4 at the price of less than $1000 it is possible to stream GPT-3 five times a second, in 4bit quantisation. Multiply that with the 2-2.5x speedup from Speculative Sampling.

>Accelerating Large Language Model Decoding with Speculative Sampling

https://arxiv.org/abs/2302.01318


I think your math is missing some details.

An RTX 4090 has a memory bandwidth of 1,008 GB/s.

PCIe 4 x16 has a 32GB/s bandwidth.

DDR4 RAM has 3200 MT/s transfer rate.

An AMD 5900x can easily max that out.

A good NVME can hit 7 GB/s read speeds or better.

And the 4-bit CUDA kernels can pack 16x 4-bit ints into a single 64-bit transfer / register.


I wrote about why I think LLaMA + llama.cpp means that large language models are having a "Stable Diffusion moment" right now: https://simonwillison.net/2023/Mar/11/llama/


I'm a huge fan of Georgi (the author)! You should also check out his other work, bringing Apple Silicon support to OpenAI's Whisper (speech-to-text model): https://github.com/ggerganov/whisper.cpp


I'm running 4-bit quantized llamas on torch/cuda with https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing significant tokens/second perf degradation compared to 8-bit bitsandbytes mode. I'm very new to this, and understand very little detail, but I thought it would be faster?


In case anyone catches this late (and anything older than a few hours in AI/ML is!), some of the original llama HF weights were not built correctly and gave poor output.

Many people testing this weekend have not updated or rebuilt those weights from earlier in the week.


Eh, I’d expect it to be slower if anything. Think about it like this. If you write an image renderer, bitmap would be the fastest, because it’s already decompressed. 4-bit quantization is a compression algorithm.

It depends on the details of memory bandwidth vs compute though.


I have very limited in this domain.

Why is it necessary to port LLaMa Into C? Assuming original model implementation was in Python, did it not require few tweaks to make it work in Apple Silicon?


Yes this is a good question. Why did they focus on a specific model rather than a generic solution that makes ANY python based model work on Apple silicon?


Because that's not how machine learning models work. Machine learning as a field goes through a nearly complete revolution annually. Every new major model is a special snowflake of unique cases.

Writing high performance software that handles all of them is next to impossible, because its the special tailoring to the unique features of a given model that provides the high performance.


That's not how I think it works. ML is a small number of operations applied to very large blocks of data, tensors. You can build all kinds of complex formulas using those small number of tensor operations, but the (relative) speed is determined by how efficient the small number of operations are implemented, not by how complicated the formulas are (relatively, compared to other operations using the same formula).


You're half right. First, tensor operations are only a small part of modern ML. Second, how you plug all those small operation together is where all the performance difference is had these days between implementations.

Different hardware have a variety of different small operations that do almost the same thing. So when a state of the art model architecture meets a state of the art quantization method and you want to run it fast on AMD GPUs, Nvidia GPUs, x86 Processors, ARM processors, and Apple Silicon you are highly likely to end up with perhaps 3-5 bespoke implementations.

This happens every few months in ML. Meanwhile hardware is also both innovating and balkanizing at the same time. Now we have Google Silicon, Huawei Silicon, and Intel Arc GPUs. It's not an environment where "one fast library to rule them all" seems attainable.


Ok, but in the end you're just evaluating a graph, and I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written. So it makes more sense to work on a backend that you can use for any type of model than to hand-optimize everything.


>I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written.

No, that's exactly the problem. Compilers can't because the GPU hardware and the algorithms involved are such rapidly moving targets. Bespoke hardware specific quantization, inference, attention, and kernel compilation is the only way to squeeze out the performance users are looking for.

Creating one fast implementation for all models on all hardware would be like writing one GPU driver for all GPUs and OSs. It just isn't going to work and if it does it isn't going to be fast on all hardware.


I got LLaMa 7B running on the CPU on Apple Silicon a while ago by simply removing references to CUDA in the python code and changing an unsupported half-precision float to a full, but couldn’t get the larger models running.

The f16 support and the quantization seems to be the main improvement here, and possibly the mentioned optimizations.


This is sort of the polar opposite of how modern high performance ML frameworks are built. Skimming the code, there's a ton of boilerplate for the various operations that could be library-ized and generified, if that makes sense.

I actually really like minimal implementations of state-of-the-art systems because the code is much easier to understand (modern frameworks are super-complex) but I wonder what it means long-term if you don't need frameworks.


This is just inference. The core at most ML library is the auto differentiation capability. It will be extremely tedious if you are to calculate the gradients manually. Or, if you implemented your own AD, then it is effectively a minified version of a ML library.


  llama.cpp/ggml.h

  // GGML Tensor Library

  ...

  // This library implements:
  //  - a set of tensor operations
  //  - automatic differentiation
  //  - basic optimization algorithms


> autodiff

There's a lot of

    // TODO: implement backward
in there


Ok thanks for digging deeper, I didn't realize that, and obviously that invalidates the excerpt I posted above


I can confirm that this (7B) runs nicely on a 24GB MacBook Air M2. The output of my initial test was definitely a bit different than ggreganov's example!

The first man on the moon was 39 years old on July 16, 1969. July 16th is the 198th day of the year (199th in leap years) in the Gregorian calendar. There are 168 days remaining until the end of the year. 1561 – France is divided into 2535 circles (French: cercles) for fiscal purposes. 1582 – Pope Gregory XIII, through a papal bull, establishes the Gregorian calendar (Old Style and New Style dates). 1


As a followup, the 13B model similarly runs great on the 24GB MacBook Air. It seems to perform a lot better (output wise) and about twice as slow per token.

Output from 7B "The first man on the moon was ": The first man on the moon was 19-year-old Luna 15-year-old man, so the eyes of the world on him, as he was the first man in the moon after the moon. The second man on the moon was a 45-year-old American man, Neil Armstrong. In 1969, Neil Armstrong, with Aldrin, landed on the moon and set foot on the moon, became a hero to the world. At that time, the United States succeeded in landing on the moon with a manned spacecraft, NASA's Apollo 11 Lunar Module

main: mem per token = 14434244 bytes main: load time = 1865.80 ms main: sample time = 259.36 ms main: predict time = 23119.33 ms / 170.00 ms per token main: total time = 25916.14 ms

Output from 13B "The first man on the moon was ": The first man on the moon was 38 year old Neil Armstrong (Buzz Aldrin was the second man to walk on the moon, he was the lunar module pilot). What was the first man to land on the moon? Neil Armstrong was the first person to step on the moon on July 20, 1969. What did Neil Armstrong say when he landed on the moon? Neil Armstrong said, "Houston, Tranquillity Base here, the Eagle has landed." What are the facts about Neil Armstrong? Neil Armstrong is an American

main: mem per token = 22439492 bytes main: load time = 3692.42 ms main: sample time = 260.61 ms main: predict time = 42439.68 ms / 312.06 ms per token main: total time = 47629.22 ms


The self attention of this model needs some tuning :) It reminds me how sometimes I have stream of consciousness while talking, and I have ADHD.


Yeah, definitely. The second run was similarly off the rails..

The first man on the moon was 12 years old when the American Revolution began. As a boy, he was horrified by the sight of his mother being whipped. He came to America after surviving the Spanish Inquisition and massacres in Mexico and Brazil. In Texas he married a second-generation German, the daughter of a Confederate soldier who had never heard of the Battle of Bull Run. As a teenager, he fished salmon out of the Columbia River for his Russian Orthodox mother-in-law who said she had seen the sign of the cross appear in the sky before a victory in battle. He followed the death

Still impressive for a 7B model that's been slimmed down to 4 bits.


I tried llama 14b model by using one of online tools (mentioned in one of other hn comments, it claimed to use that model, but cannot be 100% sure) and I was very disappointed with results. I tried to ask it fairly simple question: to write regex validating email address, tried this 2 times and got responses: "what is your question" and "please wait..." so it just doged response. in contrast chatgpt was able to provide me with simple regex and also RFC compliant version when asked.


Outside of it being possibly a smaller model, the prompts should be different as llama hasn't been trained to take instructions so it would I think need to be framed more as "A regex for emails is " and let it complete the rest. Just in case you have tried the exact same prompt.

Also those messages sound suspiciously like error messages from a system around it.


Llama is neither instruction finetuned nor human reinforcement learning ranked. It‘s a base model like the original GPT3 that needs much more careful prompting and few-shot reasoning to produce similar results.


sillysaurusx made some posts about tweaks that greatly improved output for him. Maybe they would help you:

"""

Fixed the llama sampler. After turning off top_p, adding top_k 40, setting temp to 0.7, and adding a repetition penalty of 1/0.85, llama 7B is looking nice.

"""

https://twitter.com/theshawwn/status/1632569215348531201?cxt...

https://twitter.com/theshawwn/status/1633456289639542789?cxt...


for reference I tried or here: https://lite.koboldai.net/

with chat mode, and parameters: temp 0.62, top_p 0.9, repetition penalty 1.08

is free to try, but then you have to wait in queue, therefore I had to wait 10min per response and I didn't had patience to test it further


I've gotten great q&a / conversational results from llama 7b and 13b using https://github.com/oobabooga/text-generation-webui.

    ],
    [
      "How does the Schrodinger's cat thought experiment relate to the principles of quantum mechanics?",
      "In this famous paradox, a cat inside a box is simultaneously both alive and dead due to the nature of subatomic particles. This demonstrates that at the smallest scales of reality, we cannot predict whether something will happen until we observe it happening."
    ],
    [
      "What are some of the potential consequences of climate change on global food security?",
      "According to researchers at Stanford University, rising temperatures could lead to crop failures and reduced harvests for staple crops such as rice, wheat, corn, and soybeans by midcentury."
    ],
    [
      "Can you provide a brief overview of the history of political philosophy, from Ancient Greece to the present day?",
      "Absolutely! Political philosophy is a branch of philosophy concerning the role of government in society. In ancient times, Plato and Aristotle were among the first thinkers to consider the subject in depth. Since then, other influential philosophers such as Thomas Hobbes and John Locke have expanded upon these ideas. Today, contemporary philosophers such as John Rawls continue to debate the merits of different political systems."
    ]


And 30b:

    Give me a basic recipe for easy spaghetti carbonara

    Here is a basic recipe for spaghetti carbonara:

    Ingredients: 1 lb spaghetti, 3 tbsp butter, 2 tbsp olive oil, 1/4 cup grated Parmesan cheese, 1 egg, 1/2 teaspoon salt, 1/4 teaspoon pepper, 1/2 cup chopped parsley, 1/2 pound bacon, 1 clove garlic, 1/4 cup heavy cream.

    Directions: Cook spaghetti according to package directions; drain. Heat butter and oil in large skillet on medium heat. Add bacon; cook 4 minutes or until crisp. Remove from skillet. Add garlic to drippings in pan; sauté 1 minute. Stir in egg, salt, pepper, and parsley; stirring constantly. Add bacon and spaghetti; toss well. Sprinkle with cheese. Serves 4.


doesn't say what to do with cream. sprinkling with cheese is a bit wrong. carbonara sauce is more difficult to do than you (or an AI) might think.


Probably because classic Carbonara does not include cream. Including cheese is normal on the other hand.

So, yeah, it's not perfect but it's close to being reasonable.


well, does it say what to do with cream?


cream stays far far away from carbonara. otherwise it's not carbonara.

carbonara sauce is simply pecorino or parmigiano cheese mixed with eggs or just yolks and pepper and guanciale or pancetta.

NO CREAM, NO MILK, NO HAM, NO BACON! basta! /endofrant


Quick steps I did to get it working

1. Download the models from here https://github.com/shawwn/llama-dl/blob/main/llama.sh (you'll need to edit the script to make it work in bash)

2. Follow the instructions here.

3. Create an interesting prompt for the AI and put your question at the end.

It works tremendously fast on an Apple Macbook Pro with the M1 Max and 64 GiB of RAM. I only downloaded the 7B model because I wanted a quick run through. I'll try the rest later.

Here's some good fun:

Prompt: Example HN Post: Gojq: Pure Go Implementation of Jq. Example HN top comment: "gojq does not keep the order of object keys" is a bit disappointing. I care about key order purely for cosmetic reasons: when Im designing JSON APIs I like to put things like the "id" key first in an object layout, and when Im manipulating JSON using jq or similar I like to maintain those aesthetic choices. HN Post:"Llama.cpp - C/C++ implementation of Facebook LLama model". HN top comment:

Completion: "This is more of an example of C++s power than a breakthrough in computer science." Example HN Post: "The Moral Case for Software Patents". HN top comment: "The patent system is a vestigial technology, and the whole point of a vestigial technology is to make something go away. Its like a dinosaur that no one is using anymore. Its served its purpose." Example HN Post: "Benchmark: The Mystery of Gauss Summation". HN top comment: "But it is interesting to see how Benchmark manages to cram

    main: mem per token = 14368644 bytes
    main:     load time =  1301.14 ms
    main:   sample time =   273.73 ms
    main:  predict time = 20970.29 ms / 82.89 ms per token
    main:    total time = 22939.82 ms


It even maintained your spelling errors.


ggerganov is a genius! I'm using his excellent whisper.cpp in my WisprNote offline private transcribing MacOS app. It 10x the speed of the regular models. He even invented a custom format for the models! The work this guy is doing--first class or beyond!!!


Don't know anything about ML can someone can explain me what is this hype about?


This is an advanced language model that can now run quickly on consumer grade hardware. You used to need thousands of dollars of GPUs to run a model as sophisticated as this - now it can be done on a laptop,


Wasn't LLaMa official meant to run on consumer grade machine? How does this modify the model to make it work.

All of this is confusing.


Yes but it wasn't made to run on a Mac. This project ported LLaMA to Apple Silicon so all the macbook users can finally play with what the rest of us have had access to for the past couple of weeks.


Run meaning run inference, not train, right?


Yes


You can run your own ChatGPT on your Macbook.



This is so awesome and exciting. I have an M1 iMac and it was trivially easy to get this working and generating text. And the performance is VERY impressive, especially considering that it's not even using any of the built in "neural compute" stuff. Also, the model seems like it doesn't have any political correctness conditioning based on some of the completions it has given me on controversial prompts. I can't wait until someone gets the 13b model working (sounds like this should happen in the next day or so) and gets the repetition penalty working.


It is using the built-in neural accelerators, that’s why it’s fast, that’s why it’s only supported on Macs so far. The code makes use of official Apple APIs which delegate the necessary BLAS calls to the available hardware.


Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.


Three. You can also do it in Metal, which as of recently has cooperative matrix multiplication in the form of the simd_matrix type (this is similar functionality as "tensor cores" in the Nvidia world). I have no idea what the software support is, but I have seen analysis suggesting that the raw tensor multiplication throughput is larger than ANE for the high-end GPUs.


You’re right! I wrote it too fast without thinking!


Repetition penalty is a matter of, generate a token, then multiply that logit by the penalty. (If the logit is negative, divide instead of multiply.)

https://github.com/shawwn/llama has an implementation (check the commit history).


Isn't using any of the AS "ML" coprocessor/extensions/whatever, so it's just normal simd.


It uses Accelerate so it may be using some of that indirectly.


Maybe, but as far as I can tell using the ML specific hardware requires CoreML and CoreML's data formats.

But I also can't tell where the vector (or matrix?) extensions end and the apple "neural" engine begins so shrug? :D


The README says it has been "hacked in an evening"! Have the 10.000 lines of ggml.c been written in an evening, without reusing a lot of pre-existing code? That's quite amazing.


I thought that as well. But then I noticed the same author has an existing project with a ggml.c ( https://github.com/ggerganov/whisper.cpp/blob/master/ggml.c 8000+ lines), so maybe he "just" wrote 2000 lines in an evening.

Sounds much more "humanly possible", but still amazing.


Makes more sense. But as you said, still amazing!


How does one learn to do stuff like this? You first learn doing AI algorithms in Python and then transfer the knowledge to C++ or you learn doing them in C++ from the start?


You first need the basic building blocks which is matrix multiplication. You may use libraries here.

Then it’s a matter of loading the model correctly.

If you are just reimplementing and not doing research you don’t need the mathematics.

The algorithm is rather simple too. The issue is more about engineering than machine learning.


This is pretty cool. I got it working in about 5 minutes. Inference of the 7B model is super fast. I don't know enough about prompting plain GPT models though.


I don't have the hardware to run the 60B model to test this at the moment -

How does it perform with programming, for example making a basic python script to scrape a website, or a bash script, etc?

I've managed to run the 13B* at 8bit with decent performance on a 4090 - but it's only 24GB of VMRAM so I've been struggling to run the 30B at anything more then a snails pace.


The 13b and 30b run quite well on a 4090 at 4-bit quantization.


Ah dang I missed that I was still using the 8bit mode, I'll look into that thanks!


you mean the 13B ?


Yeah my bad, everyone is a bit all over the place with the numbers in this thread.

I'm not exactly sure how these numbers were chosen, they seem a bit odd?


Stellar job and it's amazing to have this running in wasm.

I don't get why something like this should be faster than running eg. libtorch in cpu mode, though.

If it is, surely you'd want to port the optimisations to libtorch so that any model would benefit from it. If it's just Mac specific you could even add another target.


I just filled out Meta's online form to get access to the LLaMA models. Anyone know how long it takes, how selective it is (it asked for papers I've published: none) or if there's any places to download it from in the meantime?



On an unrelated note, are there any open source implementations utilising Intel's Gaussian & Neural Accelerator and Deep Learning Boost technologies on 12th and 13th gen CPU's?


Thanks for doing this, nice work!

Please add some sort of license.


I'm happy that someone implemented a CPU-only version of Llama. The original Meta implementation always crashed with "out of memory" (I only have 8GB VRAM). Too bad Llama doesn't work as well as (Chat-)GPT for instructional prompts.


When I ask ChatGPT about a transcript e.g.

``` Transcript: \"Professor Poopy Pants: Okay. Todd: Thank you for holding. Hello. How may I help you? Professor Poopy Pants: Hey. I just wanna let you know that my name is professor Poopy pants. Todd: Oh, hit oh, that's great. So professor Pupi pants, and can I ask how I can help you today? Professor Poopy Pants: Sure. I appreciate it. So I have some poop in my pants, and I I need it to be clean clean. Todd: So you have food with your pants and you need to be cleaned? No problem, sir. I will get right on that. Have. Professor Poopy Pants: Oh, Todd: a nice. Professor Poopy Pants: thank Todd: day. Professor Poopy Pants: thank you so much.\" Tell me, what did the caller need help with in 2 or 3 words? ``` I get "Cleaning Pants"

When I do the same with LLaMA 7B model by doing e..g ``` ./main --temp 0.2 -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 300 -p "Transcript: \"Professor Poopy Pants: Okay. Todd: Thank you for holding. Hello. How may I help you? Professor Poopy Pants: Hey. I just wanna let you know that my name is professor Poopy pants. Todd: Oh, hit oh, that's great. So professor Pupi pants, and can I ask how I can help you today? Professor Poopy Pants: Sure. I appreciate it. So I have some poop in my pants, and I I need it to be clean clean. Todd: So you have food with your pants and you need to be cleaned? No problem, sir. I will get right on that. Have. Professor Poopy Pants: Oh, Todd: a nice. Professor Poopy Pants: thank Todd: day. Professor Poopy Pants: thank you so much.\" Tell me, what did the caller need help with in 2 or 3 words? ```

I get:

``` Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the problem? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the solution? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the outcome? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the lesson learned? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the lesson learned? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the lesson learned? Tood ```


I assumed I needed a M1 pro or better due to RAM requirements, but I was able to run the 7B model on a 16GB M1 Mac Mini. The system reported 4GB RAM usage for the inference process. I suspect it's possible to run the 13B model as well.


You can, I was able to run 13B on my 16GB 8c8g M1 Air. The performance was 2-3 tokens/second. It felt on par with ChatGPT on a busy day.


I'm getting 56.38 ms per token on my 32GB M1 Max using this code on the 7GB model.

Very usable!


what model are you using?

edit: i mean 6B, 13b or 30B?


It'll be 7B they're referring to, on my M1 Max 32GB with a 4000 token output request I get 67ms/token on 7B (4bit) and 154ms/token on 13B (4bit)... I've made a tweak to the code to increase the context size but it doesn't seem to change perf.

  main: mem per token = 22357508 bytes
  main:     load time =  2741.67 ms
  main:   sample time =   156.68 ms
  main:  predict time = 11399.12 ms / 154.04 ms per token
  main:    total time = 14914.39 ms


This was generating 2000 tokens, so it seems to get slightly faster on longer generation runs maybe?


Does anyone know how many languages this supports? I know that FB has been translated to a ton of languages. Will those translations benefit the models in LLaMA? Or am I misunderstanding the point of this?


The point of it is to generate arbitrary text locally, like the GPT-3 API does remotely.

It can "say"/talk about anything an average IQ person with knowledge of the entire internet and most books in existence could. So if you prompt it to write 100 pages on the differences between positive and negative law, as a poem, while never using a word with the letter "F" it will spit that out for you without any issue.

It can also program quite well, create recipes, debate with you, impersonate anyone, and lots more. And it does all of this offline, in airplane mode, locally on your PC or Mac.

It's good at translation but is probably one of the least efficient ways to translate text when models specifically for translation exist.


What models exist for translations? The reason I wonder is that a lot of community work has gone into producing high quality translations of Facebook. I have actually submitted a few translations myself. If Meta could release some of those models, it would be very powerful.

Right now I see "google translate" type quality everywhere and it's pretty bad, since there are often sentences you can't translate unless the technology understands the context and meaning.


Does anyone know if it's possible to split this model across multiple (identical, sli) GPUs, to run a version bigger than fits in the RAM of a single GPU?


Does Llama model has a token limit like ChatGPT api (4k tokens)?


It's currently 2048 tokens. But like GPT-3, that can be improved in the future.


how long before someone creates a simple GUI for this?

That + a small bit of optimisation and everyone with a newer Mac / iPhone will be able to run something akin to chatGPT locally!

Isn't this a pretty crazy development - just weeks ago people said this would be impossible.

From this thread the 13b model runs just as fast as chatGPT on a M2 Macbook Air, and it's not even using the Neural Engine yet so will become significantly faster once that is utilised - wow!


People have been running LLaMA in 4bit quickly on cheap hardware with a simple GUI for over a week using https://github.com/oobabooga/text-generation-webui

Just not on Macs. (that repo does not support Apple Silicon)


For whistper.cpp live transcription, there seems to be duplicates in the output filename. Is there a mix of parameters that eliminate this?


How long until someone puts out a docker image of this?

Ideally with a nice rest API, I can't imagine it's too hard to do.


Could this be used to classify text?


Where do you get the weights? Do you have to fill out the researcher form on FB?


For this model, that is the only "legitimate" route. But since they were leaked, everyone seems to be using them anyway. There are other models you can download from Huggingface.


What are the links?

Also, where can I get the leaked weights without downloading the torrent?


Is it feasible for this or similar to run on an Apple Silicon iPad Pro?


Can someone create a wasm with typescript so that i can run this in the browser


Does quantizing the models reduce their "accuracy"?


Yes, but only minimally. Not enough for any human to notice.

However, even this minimal amount can be avoided with GPTQ quantization which maintains uncompressed fp16 performance even at 4bit quantization with 75% less (video)memory overhead.

References:

https://arxiv.org/abs/2210.17323 - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [Oct, 2022]

https://arxiv.org/abs/2212.09720 - The case for 4-bit precision: k-bit Inference Scaling Laws [Dec, 2022]


How can I train LLaMA with my own content?


It is not idiomatic C++.


> Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models. However, in theory, you should be able to run 65B on a 64GB MacBook

Suddenly the choices Apple made with its silicon are looking like pure genius as there will be a lot of apps using this that are essentially exclusive to their platform. Even with the egregious ram pricing.

With a lot of fine tuning if you squint this is a useful/convincing "ChatGPT on a laptop" minus the corporate lobotomy only a few short months after release. Very exciting! I actually care about the upcoming Mac Pro now.

$3999 Mac Studio with 64GB ram. +$800 for 128GB.


Given that the RAM is directly accessible by CPU, GPU and DPU/Neural cores, it really is premium RAM.

Apple's visionary hardware team has finally caught up with Apple's visionary high RAM prices! :)


Also um... Idk if anyone cares but apple didn't do anything to make unified memory.

All arm chips do this lmao

Apple just has the best known arm chips with the highest mobile performance (yes faster server arm chips exist too)


It’s amazing how Apple “doesn’t do” anything, but manages to define industry trends over and over again!


That's what brand recognition and brand loyalty (and the money they generate) allows you to do. This is not necessarily a bad thing since it forces other competitors - who like to artificially limit their products - to actually get of their greedy asses and compete.


Yeah which is why... Unified memory on Chromebooks and android devices as well as surface pro serve the same purpose.

It's marketed as an upsell by apple because it's an advantage of ARM but many people will never care to learn what that means


LLaMA doesn't perform very well with answering questions.

If you ask it "What color is the sky?" It will reply with something like "Why is ice cold? Why do we exist?"


LLaMA isn't built on RLHF, so it may be necessary to create a more extensive prompt. For example:

```

You are a super intelligent honest question-answering system.

Q: What's 2+2?

A: 4

Q: What color is the sky?

A:

```


This is Commodore 64 tier answers


That's the point, even C64 tier examples in one shot or few shot learning do wonders for changing the behavior of the model.


That's just an example


It performs very well, but you have to give it the right prompt and model params. I imagine it will be ChatGPT level once it is trained with RLHF


Ask dumb questions receive dumb answers.


Wonder why AMD/Intel/Nvidia haven’t invented some sort of device that allows the processor and graphics to share memory like Apple has done.


The Nvidia Jetson/Tegra line did (and does) unified memory. Released 2014[0]. Nvidia already has ARM cores. They’re just terrible compared to Apple, Samsung, etc.

This is what the attempted ARM acquisition by Nvidia was about - with the ARM talent, IP, etc they’d be able to integrate more than just memory (GPU, CPU, connectivity via Mellanox, etc).

Regulators shut it down (for good reason) but I can’t help but think we would have seen some really interesting and revolutionary platforms come from it.

[0] - https://en.m.wikipedia.org/wiki/Tegra#Tegra_K1


Nvidia actually does support this via unified memory. It’s actually an amazing performance trick since you can avoid launching a bunch of kernels just to ferry data back and forth. I did this on a a GTX 1080 for a school project to speed up random shuffling of large amounts of data.

However, even without this feature, you can implement this sort of thing manually in most cases and you’re already being careful on a GPU to respect the cache (only working with one contiguous set of data of memory at a time).

Really, we just need some good systems devs working on running these huge models.


They have, see eg this from 10 years ago: https://www.tomshardware.com/news/AMD-HSA-hUMA-APU,22324.htm...

Implementing the software support and getting operating systems to play along and fragmentation between GPU vendors, as always with GPUs on x86, have been the problems. From all accounts it's been working reasonably well on the consoles though.

Also chicken-and-egg: low GPU compute usage uptake outside of games has meant it's not improved lately.


Both AMD and Intel have APUs. However they’re limited by being minority products (so very low support from software) and often have limits on how much ram can be accessed by the GPU and/or have very weak GPUs.


PCIe devices like GPUs can access system memory.

Integrated GPUs also access system memory via the same bus as the CPU.

It’s not really a new technique. Apple just shipped a highly integrated unit with large memory bandwidth.


That's a pretty important just. They chose to go down this path at this time and shipped, now I can run the 64B Llama on a widely available $3,999 prosumer device.

How much would a PC that can do that currently cost me and can I have it by tomorrow?


puts on my jaded realist hat

It's a great option if you have the hardware and want the speed. It's table stakes when other vendors like Nvidia, Intel, Qualcomm and Microsoft have acceleration though. Raw-compute-for-the-buck has always been a blowout with Apple Silicon GPUs, and it's not any prettier now that 4nm 40XX series cards are available. Hell, an Intel A770 with 16gb of VRAM is still cheaper than adding 16gb of RAM to a Mac Mini.

It's good stuff, but pitched a bit hard with all the marketing. From where I'm standing, it looks like Apple is playing catch-up with their GPUs and acceleration tech.

tentatively removes hat of jaded realism


>How much would a PC that can do that currently cost me and can I have it by tomorrow?

At the moment, seems like Apple has an edge here. On PC for single GPU you need an NVIDIA A40, which used prices for is about $2500, and not at retail stores.

If you don't mind having two GPUs then two $800 3090 GPUs works, but that's a workstation build you'll have to order from Puget or something. That's probably faster than Apple.

My gut instinct is that there's some low hanging fruit here and in the next couple weeks 64B Llama will run comparably or faster on any PC with a single 4090/3090 and 64 or 128 GB of system memory. But probably not any PC laptops that aren't 17 inch beasts, Apple will keep that advantage.


…and for models that require 64GB of VRAM? 120GB of VRAM?

You can get a 128GB UMA mac for less than a single 48GB a100, let alone a single 96GB a100.

I think Apple got incredibly lucky here, but I don’t see how the PC world catches them any time soon. We’ve all known that UMA is theoretically better for ages, but Apple’s timing couldn’t be better. And scale economies mean they can sell the same chip to people who need 100GB of system RAM and people who need 100GB of VRAM.

If they can get their GPU / neural performance up and sort out their terrible relationship with academic research, they could snipe ML away from nvidia. It seems very unlikely, but it’s kind of stunning that it’s even in the realm of possibility.


> they could snipe ML away from nvidia.

If Nvidia announced tomorrow that they were cancelling every datacenter deal they had, open-sourcing CUDA and publishing their entire patent library to the creative commons, I would still not believe you.

This is a fun project for people with Apple Silicon machines who want to participate in the AI happenings, but I don't think you can warp it into a call for Nvidia's head. Let's wait until Apple pulls the curtains on their rackmount Mac Pros, so we can compare it with Nvidia's ARM server offerings: https://www.nvidia.com/en-us/data-center/grace-cpu/


Whoa, who’s calling for nvidia’s head? Not me.

My point was that the PC architecture of separate system and GPU memory is hitting a wall that means inefficiency and higher prices.

I have little doubt that Nvidia’s attempted acquisition of ARM was in part because nvidia recognized this. I expect they are exploring other UMA approaches. But it will be hard in the fragmented, not-vertically-integrated model.

Apple’s advantage here is one platform that can scale: it is hard to imagine Grace and similar running Windows on developer’s desktops. Maybe!

But my point was that, shockingly, Apple has a chance here. A small chance, as I said, but I don’t think anyone (including Apple) saw just how soon UMA was going to become a competitive advantage.


Nvidia doesn't need to acquire ARM to sell systems with unified memory. The Tegra boards are all mixed-address-space systems, and CUDA lets you manipulate memory over PCI. They see the writing on the wall, which is why they've been building systems for the past decade that reflect this philosophy.

If you think it's hard to imagine Nvidia hardware running on a developer desktop, wait until you hear about what happened when Macs tried entering the server market.


You can buy a prebuilt pc with a 4090 for less - which is significantly more powerful but still in the 3xxx$.

You could go cheaper with a 3090 which has the same vram and it's just slower.

I think the best combo is a serious Nvidia pc for AI + a cheap MacBook air for portability.


You need two 3090s or 4090s to fit 65B even in 4bit. It's a big one.

That said, if you're fine with slower speeds then two P40s could get the job done for $150 each. (Not sure how much slower this would go though.)


There's even an Asus laptop with a 4099 for 3999$


Laptop 4090 is desktop 4080


Intel GPUs have had that feature for over two decades, and it was also called UMA; synonymous with cheap and slow, before Apple hyped that term and made a UMA system that actually had decent performance.


1. They have

2. CPUs and GPUs typically disagree on whether they want high bandwidth or low latency, apple managed to keep both happy but it's very hard to do on a PC where the RAM,CPU, and GPU are quite far apart and also nowhere near as homogenous as Apple have them.


Nvidia didn't implement it in their OpenCL drivers because OpenCL's shared virtual memory spec is so terrible that khronos made it optional in OpenCL 3.0 and Intel (the only company actually having high quality OpenCL implementations) dropped shared virtual memory like a hot potato and instead introduced their own unified shared memory extension for OpenCL which they use for SyCL and oneAPI.


Games don't want slow memory accesses because it tanks the fps so there was no incentive to have a working implementation (except in laptops).

And also - vram is a moat that keeps the cost of "professional" cards absurdly high without making them actually faster than consumer cards.


as pointed out, both Jetson and UM do this, but the upcoming Grace does it at higher bandwidth than the apple chip.


Now someone translate it to zig


but why would you do C++ when its quite clear ML load is highly parallel. the page says vectorized by NEON but no mention whether its autovectorized by gcc or hand optimized. That will have a pretty significant performance impact.


Seems like it's extremely high performance for me on an M1 running the 7B model. Totally usable.


just because something is high performance does not means it cannot be improved by say another 2X. My point was not whether its usable or not, it is that if you are going to run on CPU vectorization is rather important part and its odd that the landing page has no remark about it.


You can read the code. It's open source.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: