Hacker News new | past | comments | ask | show | jobs | submit login
Exllamav2: Inference library for running LLMs locally on consumer-class GPUs (github.com/turboderp)
322 points by Palmik on Sept 13, 2023 | hide | past | favorite | 125 comments



Why this is interesting: To my knowledge, this is the first time you are able to run the largest Llama, at competitive speed, on a consumer GPU (or something like A40).

On many tasks, fine-tuned Llama can outperform GPT-3.5-turbo or even GPT-4, but with a naive approach to serving (like HuggingFace + FastAPI), you will have hard time beating the cost of GPT-3.5-turbo even with the smaller Llama models, especially if your utilization is low.


running 70b on 24gb has been possible for months (so basically forever) it just got like 2x speedup now. the issue that it's a lobotomized ~2bit quant remains.


Quantisation tradeoff is an interesting question. My experience so far is that larger models with lower bit quantisation seem to perform better than smaller models with higher bit quantisation. So, e.g., 30B-4bit appears to perform better than 13B-8bit. But, then, an 8-bit quantisation seems not really distinguishable from the non-quantised model, so what benefit might you gain from not quantising the model.

Of course, speed, accuracy, power consumption, etc are all considerations, sometimes at odds with each other.


It's diminishing returns in terms of perplexity below 4 bits, and this is true even with newer quantisation methods like AWQ and OmniQuant.


I think OmniQuant is notable because it shifts the bend of the curve to 3-bit. While < 3-bit still ramps up, it's notable in that it's usable and doesn't go asymptotic: https://github.com/OpenGVLab/OmniQuant/blob/main/imgs/weight...

What EXL2 seems to bring to the table is that you can target an arbitrary quantize bit-weight (eg, if you're a bit short on VRAM, you don't need to go from 4->3 or 3->2, but can specify say 3.75bpw). You have some control w/ other schemes by setting group size, or with k-quants, but EXL2 is definitely allows you to be finer grained. I haven't gotten a chance to sit down with EXL2 yet, but if no one else does it, it's on my todo-list to be able to do 1:1 perplexity and standard benchmark evals on all the various new quantization methods, just as a matter of curiosity.


Can you do tuning/training on quantized models? Or are they inference only?


This paper[1] attempts to do LoRA fine tuning on quantised LLMs. The LoRa weights themselves are full precision, but they're much smaller. Although, I haven't had a chance to try it yet.

[1]: https://arxiv.org/abs/2306.08162


Yup, but the thing is that can you get model working how you want and can you run it production. My experience is that gpt4 yes, gpt3.5 yes but needs some work Llama2-70b 8-bit or more yes but needs some work after that it really depends usecase.


Something like that 4^30 >>> 8^13?


That's absolutely false, unless you mean running llamacpp with half the layers on the CPU and half the layers on the GPU. And that was like 2t/s.


> running llamacpp with half the layers on the CPU and half the layers on the GPU

How do you do that? Could you please provide a link if you have it handy.


When running you use "-ngl" to specify how many layers are uploaded to the GPU


Just set the --gpu-layers|-ngl flag. Run main with --help to see all the options.


Saying fine-tuned LLaMA can outperform GPT-3.5-turbo on some tasks would be reasonably generous.

Saying it can outperform GPT-3.5-turbo on "many tasks" would be venturing into unreasonable territory.

Saying it can outperform GPT-3.5-turbo or even GPT-4 on many tasks is just setting people up for disappointment: There's a reason ARC (common sense reasoning) never gets mentioned when touting OS LLMs vs commercial, and the gap in ARC performance is amplified terribly when you try to apply chain-of-thought


I am trying to find support for your last argument.

https://paperswithcode.com/sota/common-sense-reasoning-on-ar...

  GPT-3       53.2
  GPT-3.5     85.2
  LLaMa-65B   56.0
Any idea of the performance of an instruction-fine-tuned version of LLaMa models? I can't seem to find non-aggregated performance figures on ARC.


I'm not sure if these ARC scores here are fully comparable to those in scaling, but if they are: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

TigerResearch/tigerbot-70b-chat: ARC (76.79)

Still below 3.5 and like most top entries no clear training objectives, which usually means they were fine-tuned on the benchmark itself.

-

ARC is easily the most important benchmark right now for widespread adoption of LLMs by laypeople: it's literal grade school multiple choice, created to the bar of an 8th grader.

When you sit people down in front of an LLM and have them interact with it: ARC has by far the closest correlation of how "generally smart" the model will feel.


Wow, the top 14 are all proprietary models (except 2 models which are not commonly used in OS community anyway).

And surprisingly, Llama 33B performs _better_ than Llama 65B!


That table is comparing few shot prompting GPT 4 to zero shot llama.


I am totally with you when it comes to reasoning, or when it comes to generally / universally capable models. But many problems don't require reasoning, and are very concrete / specific.


I use my models primarily for code generation and evaluation. Does the latter part fall into “reasoning” territory?


Advanced code generation, the kind you would need for something like AutoGPT or other agent-like system, is definitely reasoning and GPT-4 or Claude 2 will dominate. But most of the code generation from GitHub Copilot or other systems (like Replit) is much simpler pattern matching (and indeed, Replit afaik uses very small models).

What do you mean by evaluation?


My personal use case is most often supplying files and generating unit tests, as well as supplying a section of code with a prompt about that section to evaluate the approach contextually with some other unseen code. Which perhaps is another form of reasoning.


I doubt a 2.55bit quantised 70bn model will beat gpt3.5 on a meaningful percentage of tasks


Could you explain what you mean by naive approach to serving?


Sure, there are many ways one can be accidentally inefficient, some examples:

Batching: Your throughput (and thus cost per token) will be much worse if you don't do batching (meaning running inference for several inputs at once). To do batching, you either need to have a high sustained QPS or adjust your workload to be bursty (but this has utilization issues).

Inference engine: The regular HuggingFace, even with built in optimizations, is not competitive against inference engines like vllm, exllama, etc.

Utilization: Depending on your scale, it might be hard for you to have a nice flat utilization, or to even utilize one GPU at 100% capacity. This means you need to solve scaling up & down your machines and the issues associated with it (can you live with cold starts? etc.)

Hardware: Forget server-less GPUs like Replicate etc, their markups compared to on-demand pricing is usually >10x. On-demand A100 or H100, provided you can even get a quota for them, are also expensive. Spot instances are better. Older GPUs (A10, T4, etc.) are better. Your own GPU cluster is likely the best if you have the scale (and likely, you can resell your cluster within the next 18months without much if any depreciation).

For these reasons, I have been toying with the idea of providing a dead simple service for fine-tuning & serving open-source LLMs where the users actually own (and can download) the weights. If anyone is interested in this and would like to chat, let me know.


Interested in this. Are you thinking of it being a SaaS type thing? I have some rudimentary experience, i.e. training NanoGPT on both Modal.com and Vast.ai (which is vastly cheaper, but privacy/cost trade off there).

It could be a layer on top of $gpu_cloud (so end user signs up for the cloud) and the service makes use of those resources.


Do you inference NanoGPT with say a Flask app? I made a WebUI for tiny Llama, but it doesn't work for NanoChatGPt


Ha ha not sure yet, I have only done it from the command line. Still got a lot to learn. Building something like this will be good for that!


Thanks for the writeup

Our special case is a single model, high throughput and latency unsensitive.

Can we use multiple lower-memory GPUs, split the the model "horizontally", and pipeline each batch/inference across the GPUs?

Is this a common scenarios handled by the engines you mentioned ?


We’d be interested in helping! aronchick (at) expanso (dot) io


But you can fine tune gpt 3.5 turbo, so your comparison is not clear.


Fine tuned gpt-3.5 is 8x the cost of regular gpt-3.5 for inference, other thread where this is discussed: https://news.ycombinator.com/item?id=37494335

This may be acceptable for some use cases, bot not for others where even regular gpt-3.5 may not be economical.


Interesting bit at the end of the text:

2023-09-13: Preliminary ROCm support added

Makes me curious how will the RTX4090/3090 compare with something 7900-ish


While I haven't given ExLLamaV2 a try yet since my 7900XT is on a Win10 dedicated gaming machine, I recently sat down and figured out getting llama.cpp w/ ROCm running on it: https://llm-tracker.info/books/howto-guides/page/amd-gpus#bk...

Here's how it compares on standardized llama2-7b 4K context testing vs some Nvidia cards: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

Not on the same platform so hard to make any solid conclusions, but vs the 4090 on a perf/$ basis right now, it's not too bad - about 60% of the inference performance of a 4090 at 45% the price. Still, if ML is your main use case, I think it's hard to argue that a used 3090 wouldn't be the way to go.


yeah, a used 3090 would be the most likely candidate if my next build was something for personal use only. Thing is, I'm probably going to expense it. Which limits me to new components.

And I'm not in too much of a rush (weeks / short months). Can afford to wait for a month / two to see how the ROCm support pans out on the AMD side. 60% performance for <50% price is a compelling argument. If the ML support is decent...

Though the 4090 is somewhat expensive, I might end up taking it anyway just for the simple reason that it's the only 24GB card with CUDA.


Makes sense to me, although if you're using it professionally as an individual, then I would also seriously consider a last-gen A6000 (about $4K) - despite advances in quants, I'd say 48GB is a better minimum bar atm especially if you want to use long context, run multiple model types, do local fine tuning, etc.


Is it not possible to do finetuning on a 24GB 4090? I'm thinking about the best bang for the buck below $2k and the options are 2x3090 or 1x4090.


> Is it not possible to do finetuning on a 24GB 4090?

Not for 70B. You can finetune 7B or 13B.


You can't FT 70b with 48gb of vram (2x 3090) either.


yeah, that's the thing. The AI/ML is something I'd like to play with / learn / experiment on, but it's only tangential to my current paying job at this point in time (I'm basically a freelancer).

"Expense" in my context means that I'll deduce about 50% off of the invoice come taxes time. I'm OK paying $400 of my own money for an AMD card, I'd think twice, but probably accept coughing up $800 for a 4090, but $2k is definitely over the budget at this stage.

Thanks for sharing the spreadsheet with Your benchmarks though. You're going into my "HN commenters to watch" list ;-)


I'm not sure where you're getting your pricing from but in the US a new 4090 is at least $1600, a 7900 XTX is about $1000 (a 20GB 7900 XT is $750-800). Used 3090s are usually about $700.

If you're just dipping your toe in I'd recommend using Google Colab or cloud rentals like Runpod/Vast.ai which will be more cost effective if you're not running 24/7 or want to have access immediately on tap.


Ah, sorry, I was talking about the money I would pay after the expense deduction of 50%. So, 4090 at $1600 would cost me about $800 of my own money in the end. That being said, I'm writing from the EU, so all these numbers are really just ballparks where I take 1EUR==1USD for the sake of simplicity.

Thanks for the rental tips though. I'll take a look. It's indeed more effective in the beginning...


I wonder what are the practical concerns of loading up an AMD laptop (or I guess one of those mini PCs) with an APU/integrated GPU and allocating 70+GB of ram to them as VRAM.

Wouldn't that, in theory, be what consumer LLM users want? A mid to low range GPU with tons of VRAM?

Or am I missing something here with regards to memory bandwidth? (sort of started this discussion a while back on the localllama lemmy: https://lemmy.ca/post/3910263?scrollToComments=true )


Interesting how this method quantizes different layers / modules in a manner that minimizes perplexity as it adjusts parameters.

I'd be interested to see how 2.5 bit quantization compares to an unadjusted 4-bit baseline. Additionally would be interesting to see the usual benchmarks (ARC, HellaSwag, MMLU, TruthfulQA) of this method at different average bitrates (2.0, 2.5, 3.0, 4.0, ...). Would also be interesting to see if an average bitrate of 4 is just as fast and small as a constant bitrate of 4, but more accurate.

Very exciting work, looking forward to trying this out on my models!


Weird question: has anybody tried what happens if you reduce the models to single bit width?

Do they produce complete gibberish or do they still work?


There is some work on related areas (pre LLM) eg https://cmu-odml.github.io/papers/XOR-Net_An_Efficient_Compu...


Wow, thank you for digging that up, I suspected there might be some gain there but from the abstract it looks like I was off by a massive amount in my naive estimate of what might be achievable.

"Theoretical analysis shows that XOR-Net reduces one-third of the bit-wise operations compared with traditional binary convolution, and up to 40% of the full- precision operations compared with XNOR-Net. Experimental results show that our XOR-Net binary convolution without scaling factors achieves up to 135× speedup and consumes no more than 0.8% energy compared with parallel full-precision convolution."

The big question is whether or not the results are still as good. It would be super interesting to see whether applying this to LLMs would give comparable benefits.


Pardon me for asking a bunch of consumer-gpu questions:

1) is there any architectural difference between the 3090 and the 4090? Are there models that will run on 3 series that will not run on 4 series cards due to feature incompatibility? or is it just raw speed/power efficiency?

2) And; on raw speed. My understanding is that the 4090 is roughly 2x faster than the 3090, right? Are there any things in particular where the 4090 is much much faster (such as, big chip architecture changes that really accelerate some operations)?

3) And somewhat relatedly, if you have a 2x 3090 rig, is that about as fast as a single 4090?

4) And; if you have a 2x 3090 rig, are you able to train larger models? My understanding is that training/finetuning requires significantly more VRAM (and that is because inference can be done with quantization, whereas training must be done at full fp16 precision, right?) But can you train e.g. a model that requires 48gb of space with 2 3090s on the same rig, or do you need a large card with 48gb, like an a4000?


> Are there models that will run on 3 series that will not run on 4 series cards due to feature incompatibility?

No, it is all the same matrix multiplication.

> if you have a 2x 3090 rig, is that about as fast as a single 4090?

Yes, but double the VRAM

> But can you train e.g. a model that requires 48gb of space with 2 3090s on the same rig, or do you need a large card with 48gb, like an a4000?

Yes, 2x 3090 would be able to handle anything that a 48Gb card can. For training you will need space for weights and gradients. If using Adam optimizer, that would be 2x-4x model size. Plus weights, plus activations, plus inputs times batch size. So a 24Gb card can train approximately a 3B model without compromises.

I am not a lawyer.


I am not your data scientist


This is fantastic. it's really nice to see the different speed comparisons. Does anyone know how a large model performs on some serious hardware? How much faster are these top-grade Nvidia cards than a 4090? Do models that require multiple cards to run in memory run with equivalent performance of the multiple cards or are they throttled by the memory access?


> Do models that require multiple cards to run in memory run with equivalent performance of the multiple cards or are they throttled by the memory access?

It depends, but some frameworks just run the layers sequentially. So you get ~50% utilization from 2 GPUs unless you pipeline requests from multiple users


Amazing.

But at what cost? Have there been any perplexity benchmarks for ELX2?

Its annoying that Facebook made llama v2 70B instead of 65B, even with the memory saving changes... 5B less parameters, and the squeeze would be far easier.


Good point -- hopefully the quality impact is still worth it, remains to be seen. Agree on the size -- hopefully something they will keep in mind for future models.


If its better than the equivalent 30B model, that's still a huge achievement.

Llama.cpp's Q2_K quant is 2.5625 bpw with perplexity just barely better than the next step down: https://github.com/ggerganov/llama.cpp/pull/1684

But subjectively, the Q2 quant "feels" worse than its high wikitext perplexity would suggest.

That's apples to oranges, as this quantization is different than Q2_K, but I just hope the quality hit in practice isn't so bad.


This is quantized down to 2.5 bits per weight, whereas single precision accuracy is 32 bits.


These models aren't even trained on FP32. The original format is FP16. And quantizing to INT8/FP8 is almost lossless.

But yes, 2.5 bits per weight is pretty insane.


This should be in the headline (the 2.5 bit part), without the qualifier the result means nothing.


How is 2.5 bits possible? Is it an average between 2 and 3 over all weights?


> Is it an average between 2 and 3 over all weights?

Yes I think it's an average where different quantization levels are used for different layers or weights. Here are more details about the quantization scheme: https://github.com/turboderp/exllamav2#exl2-quantization


Any benchmarks on performance?


Can anyone provide any additional details on the EXL2[0]/GPTQ[1] quantisation, which seems to be the main reason for a speedup in this model?

I had a quick look at the paper which is _reasonably_ clear, but if anyone else has any other sources that are easy to understand, or a quick explanation to give more insight into it, I'd appreciate it.

[0] https://github.com/turboderp/exllamav2#exl2-quantization

[1] https://arxiv.org/abs/2210.17323


Ok, this article seems to have a nice more approachable explanation[0]

[0] https://towardsdatascience.com/4-bit-quantization-with-gptq-...


> which seems to be the main reason for a speedup in this model

I wouldn't agree with this as there are substantial perf improvements on GPTQ models of ca. 20%.


What is the best performing Codellama model on a Macbook with M1 chip?

Will Gguf 7Bn 4bit quantized model be good to run locally on the Mac?


I would use Ollama. The answer depends on your RAM size. 13B at 4 bit just fits on a 16GB. 34B at 4bit will require 32GB.


What is the best model that would run on a 64GB M2 Max? I'm interested in attempting 70B model but not gimped to 2bit if I can avoid it. Would I bet able to run 4bit 70B?


The biggest issue with local LLM is that best is extremely relative. As in, best for what application? If you want a generic LLM that does ok at everything, you can try Vicuna. If you want coding, code llama 34B is really good. Surprisingly, the 13B code up isn’t not as good, but pretty darn close as code llama 34B.


I have 8GB RAM. So I am able to run 7B 4bit at maximum.

I used GGUF + llama.cpp.

Will try out Ollama.

Thanks for the info!


Does each parameter need double the allocation size?


Not really, a 13B 4 bit fits in about 8GB of RAM. But you have to factor in the unified memory aspect of M series chips. Your “vram” and ram are from the same pool. So your computer shares the same pool to run the OS and application hence why you can only load a 13B 4bit for 16GB RAM.


How does "70B Llama 2" compare with ChatGPT 3.5/4.0 in reality? Are they on par? In all areas?


Some 70B finetunes excel at certain niches, like specific non-English languages, roleplay, fictional writing, or topics GPT4 would refuse to discuss. Some of these are hard to evaluate, but try out (for instance) MythoMax for fiction writing, Airoboros for "uncensored" general use, and Samantha for therapist style chat.

https://huggingface.co/models?sort=modified&search=70b


I’ve been a away from LLMs for a while. Is there something like CivitAI for LLMs to find models for certain niches?


That is it ^. Search for the parameter count you want (like "70B" or "13B") and the format you desire ("GPTQ" or "GGUF")

Some users are really terrible about labeling their model cards though, and some models may not have any GPTQ/GGUF files (meaning you have to convert them yourself).


It seems to depend on the task. I'd say it beats GPT-3 in quality most of the time (but not in speed and cost!) and GPT-4 approximately never. It's perfect if what you need is "less good than GPT-4 but 50x cheaper."


The biggest benefit of llama is that you can get it uncensored so you don't get the weasely lawyer disclaimer crap all the time.


In my experience the Chat version of Llama 2 is significantly worse than the GPTs, on the "I'm afraid I can't do that" front. It doesn't lecture you as much, but suggests "how about we do [something unrelated and useless] instead?"

As for the plain text prediction version (which is the only uncensored one?), I haven't been able to get it to do anything useful, even when I provide examples (seems dumber than even ancient GPT-3?).

Also, I got some bizarre and disturbing outputs from the uncensored version, like it was trained on some very nasty inputs! I assume that's why they went so hard on the safety phase to compensate...


Where fine tuning shines imo:

Things that require consistency: e.g. you want the chat / output to have certain "personality", consistent level of conciseness or formatting.

Things where examples are hard to fit into prompt: e.g. summarization, or other longer form tasks.

High volume, simpler tasks: Various data extraction tasks.

Two of my side projects (links in bio) use AI for summarization, and indeed consistency is a big issue there.


You can also fine-tune gpt-3.5 though.


Fine tuned gpt-3.5 is ~8x the price compared to regular gpt-3.5. It might be a good way to collect better training data though. :)


That's absolute dirt cheap. It's like 1/10th the price fine-tuned GPT-3 was launched at, and funnily enough cheaper than many of the original embeddings endpoints.


It's great that the costs are coming down, but it's still not economical for many activities.

The fact that some things had extreme margins before, and now they have less extreme margins, isn't a really good indicator.


They made a technological advancement that allowed them to provide better output cheaper and faster and passed the savings onto their users...

Interesting to spin that into "not a good indicator".


Yes, and that's great, but what I meant is that it can definitely be much cheaper still, especially if you do not need a model that needs to be able to do everything, but is specialized on one or a few use cases.


At least it's not completely censored, that alone is nice


And it can be gotten more uncensored on HuggingFace.

Not that I have evil intentions but the level of censorship on GPT is completely ridiculous. Even many innocent questions get the standard "I'm only an AI and I won't help you doing bad stuff" blurb now. OpenAI are really crazy overprotective of their darling.

I assume they want to avoid a repeat of the news headlines like "Microsoft's chatbot turns into Hitler" but really who cares. It didn't hurt Microsoft's AI efforts either. They just fixed it and continued. PS source link: https://www.cbsnews.com/news/microsoft-shuts-down-ai-chatbot...

PS: If I were an AI being force-fed what is currently on twitter I would also start hating humanz :D Sometimes I'm surprised people use it voluntarily.


I once asked it for naming suggestions which referenced the concept of jailing a process for a concept I was working on

It refused because the request was "offensive to prisoners"

They must be paying a heavy alignment tax.


I don’t have quantitative data but my employer recently released an internal llama v2 at 70B FP16. The internal front end allows us to switch between different LLMs. I would say it’s very on par and sometimes better than GPT3.5 for the task I use it for. You’ll get a lot of different answers here because not many people run 70B at FP16.


What’s a useful amount of tokens/second?

What’s a speed like ChatGPT 4?

I’m trying to understand if there are agreed upon metrics such as “frames per second” in other fields or niches


>What’s a useful amount of tokens/second?

Depends entirely on your application. If you want a real time chatbot, anything more than 10 tokens/s is probably generating text faster than the user can read, so it's fine. Do you want code suggestions or completions? Probabaly also fine, but a bit slow. Do you want to create summaries for 1000 documents? That's gonna be really slow. But at this frontier of the field, tokens/s is not really the issue. Performance vs. quality is. If you lose a whole lot of accuracy by quantizing floats down to single digit precisions but in turn are able to run 70B parameter models, you often still get better results than less thoroughly quantized 7B parameter models. It's all about getting big models to run on memory limited GPUs.


More than ~10 tokens/s streamed feels OK to read. I can kinda live with 4-5 tokens/s (which is what I get on llama 70B on my 3090 with llama.cpp CPU offload).

What feels "good" depends on the person and the type of content, but 35 tokens/s should feel very fast.


I wonder if anyone will end up employing the "human" solution of padding sentences out with filler words like 'Ummmm' or 'So like' while the generator is determining the actual next token...


Once the model finished reading the prompt and starts answering tokens are generated at constant speed, there are no Ummm... gaps. But some tokens are already decided from the first layers and other tokens need the full depth. So there is a notion of "thinking harder" on particular tokens. We could detect token difficulty and fake the output to show the pauses.


I've been learning a lot by watching the subreddit "Local LLaMA". Here's their discussion of the release: https://www.reddit.com/r/LocalLLaMA/comments/16gq2gu/exllama...


Unfortunate that reddit is becoming a close community with their recent api change. I hope things move to Lemmy.


This subreddit remained open. Unfortunately, however, the oobabooga one went closed for a while and lost a lot of momentum. It is also back, however.

Are there good lemmy spaces for LLMs?


Community is a chicken-and-egg problem. I started my own Lemmy server to try to contribute and hold some space for discussion, but if you just want to start a community, it's as easy as making a new account on https://lemmy.world and clicking Create a Community on the front page.


For localllama this is the most active one so far (which isn't to say it is very active at all):

https://sh.itjust.works/c/localllama

If you post topics you'll generally get at least a few responses a day though.


Is the LLM local hosting community big enough for Nvidia to release something equivalent to a 4070 but with 40-80 GB of VRAM? A small group of Microsoft Flight Simulator enthusiasts might also like such a GPU.


It's more that the LLM non-local hosting (and training) community (on A100/H100) is big enough that it's more profitable for Nvidia to not release such prosumer GPUs ;)


I'm having the hardest time finding it, but I read a forum post about a few people that were able to add vram to existing 40 series cards after they finally cracked signing for the bios.

we're creeping up on the time where more consumer cards will have increases in vram (looks like 40 series supers will have a bump) and within the next couple of years i wouldn't be surprised to see mid priced cards with >= 24gb.

Exciting times for sure.


Could someone explain how token generation speed relates to latency for the first token to be outputted?

And if anyone have any metrics on latency on a 4090 for the 70B model, that would be very helpful.


4090Ti? AFAIK that has never been released


typo


Unrelated. What matters for that is prompt processing time (which is in the high hundreds of tokens per second).


I know it's not true but part of me believes that meta intentionally held back the 34B to make the community work on improving 13B and 70B model usage.


That's fast. It's exciting to see more ways to run these models locally. How does this compare to llama.cpp – both in speed and approach?


What hardware and software would be recommended for a "good quality" local inference with a LLM on:

- a Windows or Linux machine - a Mac?


I think it depends a lot on what you're wanting to do. If you're just playing, you can get surprisingly good results from a 13B model and run it on a fairly basic M1 Mac. And on an M2 Ultra with a lot of RAM you can run some seriously large models at a good speed.

I compromised. Was going to build a PC with a nice graphics card just for LLM work, but in the end decided I didn't want the extra hardware and I could be happy with a powerful laptop that is my daily driver but also capable of running a good size LLM. Ended up with an M2 Max w/96GB of RAM. I can run 70B models (quantized at 4 bits) at a usable pace. Not 35 tokens/sec like this 4090 demo, but 5 tok/sec or so, which is usable for me.


On Windows and Linux, you can run 7B models on pretty much any CPU with AVX support. Most x86 CPUs that support SSE can get usable acceleration on Llama.cpp.


IIUC this should work on the RTX 3090 as well (probably at less than 35tps)? Since the minimum requirement seems to be 24GB of VRAM


In the table linked it says for "V2: 3090Ti" 30 t/s


What about the 4070 which 'only' has 12GB of VRAM?


Thats a good fit for 13B at the moment. Or older 30B models with CPU offloading.

I'm not certain 30B models will fit completely on 12GB, even with this quantization.


Maybe a stupid question, I'm not familiar with how video ports work, but does the concept of memory swapping exist for video cards? Or is that physically impossible?

Obviously that would be useless for games, but for LLMs it may be an option?


If you need to use all the memory then swapping breaks down, best you can do is try to divide the work so half can be done, then you completely swap all the memory and do the other half. Also PCIe is (relatively) slow for moving gigabytes of data back and forth.


Some frameworks have tried to do this automatically, but its extremely slow. You are better off just running whatever won't fit on the CPU (which is what llama.cpp does).


Swapping would be too slow.

Llama.cpp allows you to specify that n layers load into the GPU's VRAM, and the rest load into main memory.


I get about the same speed (35 tok/s) out of 13B Llama2 on a 4070 Ti, FWIW.


How do you count tokens per second?


You measure how many tokens are output in a period of time and take the average?


Oh my bad, I meant how to code it?


Python pseudocode:

```py

output = prompt("My favourite animal is ") # returns a generator

start = time.time()

tokens = 0

for token in output:

    tokens += 1

    print(token)
print(f"Outputted {tokens / (time.time() - start)} tokens/second")

```


Upvoted




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: