Exllamav2: Inference library for running LLMs locally on consumer-class GPUs

Palmik · on Sept 13, 2023

Why this is interesting: To my knowledge, this is the first time you are able to run the largest Llama, at competitive speed, on a consumer GPU (or something like A40).

On many tasks, fine-tuned Llama can outperform GPT-3.5-turbo or even GPT-4, but with a naive approach to serving (like HuggingFace + FastAPI), you will have hard time beating the cost of GPT-3.5-turbo even with the smaller Llama models, especially if your utilization is low.

read_if_gay_ · on Sept 13, 2023

running 70b on 24gb has been possible for months (so basically forever) it just got like 2x speedup now. the issue that it's a lobotomized ~2bit quant remains.

checkyoursudo · on Sept 13, 2023

Quantisation tradeoff is an interesting question. My experience so far is that larger models with lower bit quantisation seem to perform better than smaller models with higher bit quantisation. So, e.g., 30B-4bit appears to perform better than 13B-8bit. But, then, an 8-bit quantisation seems not really distinguishable from the non-quantised model, so what benefit might you gain from not quantising the model.

Of course, speed, accuracy, power consumption, etc are all considerations, sometimes at odds with each other.

woadwarrior01 · on Sept 13, 2023

It's diminishing returns in terms of perplexity below 4 bits, and this is true even with newer quantisation methods like AWQ and OmniQuant.

lhl · on Sept 13, 2023

I think OmniQuant is notable because it shifts the bend of the curve to 3-bit. While < 3-bit still ramps up, it's notable in that it's usable and doesn't go asymptotic: https://github.com/OpenGVLab/OmniQuant/blob/main/imgs/weight...

What EXL2 seems to bring to the table is that you can target an arbitrary quantize bit-weight (eg, if you're a bit short on VRAM, you don't need to go from 4->3 or 3->2, but can specify say 3.75bpw). You have some control w/ other schemes by setting group size, or with k-quants, but EXL2 is definitely allows you to be finer grained. I haven't gotten a chance to sit down with EXL2 yet, but if no one else does it, it's on my todo-list to be able to do 1:1 perplexity and standard benchmark evals on all the various new quantization methods, just as a matter of curiosity.

lawlessone · on Sept 13, 2023

Can you do tuning/training on quantized models? Or are they inference only?

woadwarrior01 · on Sept 13, 2023

This paper[1] attempts to do LoRA fine tuning on quantised LLMs. The LoRa weights themselves are full precision, but they're much smaller. Although, I haven't had a chance to try it yet.

[1]: https://arxiv.org/abs/2306.08162

antupis · on Sept 13, 2023

Yup, but the thing is that can you get model working how you want and can you run it production. My experience is that gpt4 yes, gpt3.5 yes but needs some work Llama2-70b 8-bit or more yes but needs some work after that it really depends usecase.

ban-lan-gen · on Sept 13, 2023

Something like that 4^30 >>> 8^13?

redox99 · on Sept 13, 2023

That's absolutely false, unless you mean running llamacpp with half the layers on the CPU and half the layers on the GPU. And that was like 2t/s.

yumraj · on Sept 13, 2023

> running llamacpp with half the layers on the CPU and half the layers on the GPU

How do you do that? Could you please provide a link if you have it handy.

ynniv · on Sept 13, 2023

When running you use "-ngl" to specify how many layers are uploaded to the GPU

gsuuon · on Sept 13, 2023

Just set the --gpu-layers|-ngl flag. Run main with --help to see all the options.

BoorishBears · on Sept 13, 2023

Saying fine-tuned LLaMA can outperform GPT-3.5-turbo on some tasks would be reasonably generous.

Saying it can outperform GPT-3.5-turbo on "many tasks" would be venturing into unreasonable territory.

Saying it can outperform GPT-3.5-turbo or even GPT-4 on many tasks is just setting people up for disappointment: There's a reason ARC (common sense reasoning) never gets mentioned when touting OS LLMs vs commercial, and the gap in ARC performance is amplified terribly when you try to apply chain-of-thought

benob · on Sept 13, 2023

I am trying to find support for your last argument.

https://paperswithcode.com/sota/common-sense-reasoning-on-ar...

  GPT-3       53.2
  GPT-3.5     85.2
  LLaMa-65B   56.0

Any idea of the performance of an instruction-fine-tuned version of LLaMa models? I can't seem to find non-aggregated performance figures on ARC.

BoorishBears · on Sept 13, 2023

I'm not sure if these ARC scores here are fully comparable to those in scaling, but if they are: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

TigerResearch/tigerbot-70b-chat: ARC (76.79)

Still below 3.5 and like most top entries no clear training objectives, which usually means they were fine-tuned on the benchmark itself.

-

ARC is easily the most important benchmark right now for widespread adoption of LLMs by laypeople: it's literal grade school multiple choice, created to the bar of an 8th grader.

When you sit people down in front of an LLM and have them interact with it: ARC has by far the closest correlation of how "generally smart" the model will feel.

behnamoh · on Sept 13, 2023

Wow, the top 14 are all proprietary models (except 2 models which are not commonly used in OS community anyway).

And surprisingly, Llama 33B performs _better_ than Llama 65B!

NhanH · on Sept 13, 2023

That table is comparing few shot prompting GPT 4 to zero shot llama.

Palmik · on Sept 13, 2023

I am totally with you when it comes to reasoning, or when it comes to generally / universally capable models. But many problems don't require reasoning, and are very concrete / specific.

SOLAR_FIELDS · on Sept 13, 2023

I use my models primarily for code generation and evaluation. Does the latter part fall into “reasoning” territory?

Palmik · on Sept 13, 2023

Advanced code generation, the kind you would need for something like AutoGPT or other agent-like system, is definitely reasoning and GPT-4 or Claude 2 will dominate. But most of the code generation from GitHub Copilot or other systems (like Replit) is much simpler pattern matching (and indeed, Replit afaik uses very small models).

What do you mean by evaluation?

SOLAR_FIELDS · on Sept 14, 2023

My personal use case is most often supplying files and generating unit tests, as well as supplying a section of code with a prompt about that section to evaluate the approach contextually with some other unseen code. Which perhaps is another form of reasoning.

Havoc · on Sept 13, 2023

I doubt a 2.55bit quantised 70bn model will beat gpt3.5 on a meaningful percentage of tasks

politelemon · on Sept 13, 2023

Could you explain what you mean by naive approach to serving?

Palmik · on Sept 13, 2023

Sure, there are many ways one can be accidentally inefficient, some examples:

Batching: Your throughput (and thus cost per token) will be much worse if you don't do batching (meaning running inference for several inputs at once). To do batching, you either need to have a high sustained QPS or adjust your workload to be bursty (but this has utilization issues).

Inference engine: The regular HuggingFace, even with built in optimizations, is not competitive against inference engines like vllm, exllama, etc.

Utilization: Depending on your scale, it might be hard for you to have a nice flat utilization, or to even utilize one GPU at 100% capacity. This means you need to solve scaling up & down your machines and the issues associated with it (can you live with cold starts? etc.)

Hardware: Forget server-less GPUs like Replicate etc, their markups compared to on-demand pricing is usually >10x. On-demand A100 or H100, provided you can even get a quota for them, are also expensive. Spot instances are better. Older GPUs (A10, T4, etc.) are better. Your own GPU cluster is likely the best if you have the scale (and likely, you can resell your cluster within the next 18months without much if any depreciation).

For these reasons, I have been toying with the idea of providing a dead simple service for fine-tuning & serving open-source LLMs where the users actually own (and can download) the weights. If anyone is interested in this and would like to chat, let me know.

quickthrower2 · on Sept 13, 2023

Interested in this. Are you thinking of it being a SaaS type thing? I have some rudimentary experience, i.e. training NanoGPT on both Modal.com and Vast.ai (which is vastly cheaper, but privacy/cost trade off there).

It could be a layer on top of $gpu_cloud (so end user signs up for the cloud) and the service makes use of those resources.

senseiV · on Sept 13, 2023

Do you inference NanoGPT with say a Flask app? I made a WebUI for tiny Llama, but it doesn't work for NanoChatGPt

quickthrower2 · on Sept 13, 2023

Ha ha not sure yet, I have only done it from the command line. Still got a lot to learn. Building something like this will be good for that!

algo_trader · on Sept 13, 2023

Thanks for the writeup

Our special case is a single model, high throughput and latency unsensitive.

Can we use multiple lower-memory GPUs, split the the model "horizontally", and pipeline each batch/inference across the GPUs?

Is this a common scenarios handled by the engines you mentioned ?

TheIronYuppie · on Sept 13, 2023

We’d be interested in helping! aronchick (at) expanso (dot) io

jumpCastle · on Sept 13, 2023

But you can fine tune gpt 3.5 turbo, so your comparison is not clear.

Palmik · on Sept 13, 2023

Fine tuned gpt-3.5 is 8x the cost of regular gpt-3.5 for inference, other thread where this is discussed: https://news.ycombinator.com/item?id=37494335

This may be acceptable for some use cases, bot not for others where even regular gpt-3.5 may not be economical.

MezzoDelCammin · on Sept 13, 2023

Interesting bit at the end of the text:

2023-09-13: Preliminary ROCm support added

Makes me curious how will the RTX4090/3090 compare with something 7900-ish

lhl · on Sept 13, 2023

While I haven't given ExLLamaV2 a try yet since my 7900XT is on a Win10 dedicated gaming machine, I recently sat down and figured out getting llama.cpp w/ ROCm running on it: https://llm-tracker.info/books/howto-guides/page/amd-gpus#bk...

Here's how it compares on standardized llama2-7b 4K context testing vs some Nvidia cards: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

Not on the same platform so hard to make any solid conclusions, but vs the 4090 on a perf/$ basis right now, it's not too bad - about 60% of the inference performance of a 4090 at 45% the price. Still, if ML is your main use case, I think it's hard to argue that a used 3090 wouldn't be the way to go.

MezzoDelCammin · on Sept 13, 2023

yeah, a used 3090 would be the most likely candidate if my next build was something for personal use only. Thing is, I'm probably going to expense it. Which limits me to new components.

And I'm not in too much of a rush (weeks / short months). Can afford to wait for a month / two to see how the ROCm support pans out on the AMD side. 60% performance for <50% price is a compelling argument. If the ML support is decent...

Though the 4090 is somewhat expensive, I might end up taking it anyway just for the simple reason that it's the only 24GB card with CUDA.

lhl · on Sept 13, 2023

Makes sense to me, although if you're using it professionally as an individual, then I would also seriously consider a last-gen A6000 (about $4K) - despite advances in quants, I'd say 48GB is a better minimum bar atm especially if you want to use long context, run multiple model types, do local fine tuning, etc.

behnamoh · on Sept 13, 2023

Is it not possible to do finetuning on a 24GB 4090? I'm thinking about the best bang for the buck below $2k and the options are 2x3090 or 1x4090.

brucethemoose2 · on Sept 13, 2023

> Is it not possible to do finetuning on a 24GB 4090?

Not for 70B. You can finetune 7B or 13B.

Tostino · on Sept 14, 2023

You can't FT 70b with 48gb of vram (2x 3090) either.

MezzoDelCammin · on Sept 13, 2023

yeah, that's the thing. The AI/ML is something I'd like to play with / learn / experiment on, but it's only tangential to my current paying job at this point in time (I'm basically a freelancer).

"Expense" in my context means that I'll deduce about 50% off of the invoice come taxes time. I'm OK paying $400 of my own money for an AMD card, I'd think twice, but probably accept coughing up $800 for a 4090, but $2k is definitely over the budget at this stage.

Thanks for sharing the spreadsheet with Your benchmarks though. You're going into my "HN commenters to watch" list ;-)

lhl · on Sept 14, 2023

I'm not sure where you're getting your pricing from but in the US a new 4090 is at least $1600, a 7900 XTX is about $1000 (a 20GB 7900 XT is $750-800). Used 3090s are usually about $700.

If you're just dipping your toe in I'd recommend using Google Colab or cloud rentals like Runpod/Vast.ai which will be more cost effective if you're not running 24/7 or want to have access immediately on tap.

MezzoDelCammin · on Sept 15, 2023

Ah, sorry, I was talking about the money I would pay after the expense deduction of 50%. So, 4090 at $1600 would cost me about $800 of my own money in the end. That being said, I'm writing from the EU, so all these numbers are really just ballparks where I take 1EUR==1USD for the sake of simplicity.

Thanks for the rental tips though. I'll take a look. It's indeed more effective in the beginning...

kelvie · on Sept 13, 2023

I wonder what are the practical concerns of loading up an AMD laptop (or I guess one of those mini PCs) with an APU/integrated GPU and allocating 70+GB of ram to them as VRAM.

Wouldn't that, in theory, be what consumer LLM users want? A mid to low range GPU with tons of VRAM?

Or am I missing something here with regards to memory bandwidth? (sort of started this discussion a while back on the localllama lemmy: https://lemmy.ca/post/3910263?scrollToComments=true )

lappa · on Sept 13, 2023

Interesting how this method quantizes different layers / modules in a manner that minimizes perplexity as it adjusts parameters.

I'd be interested to see how 2.5 bit quantization compares to an unadjusted 4-bit baseline. Additionally would be interesting to see the usual benchmarks (ARC, HellaSwag, MMLU, TruthfulQA) of this method at different average bitrates (2.0, 2.5, 3.0, 4.0, ...). Would also be interesting to see if an average bitrate of 4 is just as fast and small as a constant bitrate of 4, but more accurate.

Very exciting work, looking forward to trying this out on my models!

jacquesm · on Sept 13, 2023

Weird question: has anybody tried what happens if you reduce the models to single bit width?

Do they produce complete gibberish or do they still work?

justincormack · on Sept 13, 2023

There is some work on related areas (pre LLM) eg https://cmu-odml.github.io/papers/XOR-Net_An_Efficient_Compu...

jacquesm · on Sept 13, 2023

Wow, thank you for digging that up, I suspected there might be some gain there but from the abstract it looks like I was off by a massive amount in my naive estimate of what might be achievable.

"Theoretical analysis shows that XOR-Net reduces one-third of the bit-wise operations compared with traditional binary convolution, and up to 40% of the full- precision operations compared with XNOR-Net. Experimental results show that our XOR-Net binary convolution without scaling factors achieves up to 135× speedup and consumes no more than 0.8% energy compared with parallel full-precision convolution."

The big question is whether or not the results are still as good. It would be super interesting to see whether applying this to LLMs would give comparable benefits.

mlsu · on Sept 13, 2023

Pardon me for asking a bunch of consumer-gpu questions:

1) is there any architectural difference between the 3090 and the 4090? Are there models that will run on 3 series that will not run on 4 series cards due to feature incompatibility? or is it just raw speed/power efficiency?

2) And; on raw speed. My understanding is that the 4090 is roughly 2x faster than the 3090, right? Are there any things in particular where the 4090 is much much faster (such as, big chip architecture changes that really accelerate some operations)?

3) And somewhat relatedly, if you have a 2x 3090 rig, is that about as fast as a single 4090?

4) And; if you have a 2x 3090 rig, are you able to train larger models? My understanding is that training/finetuning requires significantly more VRAM (and that is because inference can be done with quantization, whereas training must be done at full fp16 precision, right?) But can you train e.g. a model that requires 48gb of space with 2 3090s on the same rig, or do you need a large card with 48gb, like an a4000?

coolspot · on Sept 13, 2023

> Are there models that will run on 3 series that will not run on 4 series cards due to feature incompatibility?

No, it is all the same matrix multiplication.

> if you have a 2x 3090 rig, is that about as fast as a single 4090?

Yes, but double the VRAM

> But can you train e.g. a model that requires 48gb of space with 2 3090s on the same rig, or do you need a large card with 48gb, like an a4000?

Yes, 2x 3090 would be able to handle anything that a 48Gb card can. For training you will need space for weights and gradients. If using Adam optimizer, that would be 2x-4x model size. Plus weights, plus activations, plus inputs times batch size. So a 24Gb card can train approximately a 3B model without compromises.

I am not a lawyer.

quickthrower2 · on Sept 14, 2023

I am not your data scientist

supermatt · on Sept 13, 2023

This is fantastic. it's really nice to see the different speed comparisons. Does anyone know how a large model performs on some serious hardware? How much faster are these top-grade Nvidia cards than a 4090? Do models that require multiple cards to run in memory run with equivalent performance of the multiple cards or are they throttled by the memory access?

brucethemoose2 · on Sept 13, 2023

> Do models that require multiple cards to run in memory run with equivalent performance of the multiple cards or are they throttled by the memory access?

It depends, but some frameworks just run the layers sequentially. So you get ~50% utilization from 2 GPUs unless you pipeline requests from multiple users

brucethemoose2 · on Sept 13, 2023

Amazing.

But at what cost? Have there been any perplexity benchmarks for ELX2?

Its annoying that Facebook made llama v2 70B instead of 65B, even with the memory saving changes... 5B less parameters, and the squeeze would be far easier.

Palmik · on Sept 13, 2023

Good point -- hopefully the quality impact is still worth it, remains to be seen. Agree on the size -- hopefully something they will keep in mind for future models.

brucethemoose2 · on Sept 13, 2023

If its better than the equivalent 30B model, that's still a huge achievement.

Llama.cpp's Q2_K quant is 2.5625 bpw with perplexity just barely better than the next step down: https://github.com/ggerganov/llama.cpp/pull/1684

But subjectively, the Q2 quant "feels" worse than its high wikitext perplexity would suggest.

That's apples to oranges, as this quantization is different than Q2_K, but I just hope the quality hit in practice isn't so bad.

ftxbro · on Sept 13, 2023

This is quantized down to 2.5 bits per weight, whereas single precision accuracy is 32 bits.

redox99 · on Sept 13, 2023

These models aren't even trained on FP32. The original format is FP16. And quantizing to INT8/FP8 is almost lossless.

But yes, 2.5 bits per weight is pretty insane.

version_five · on Sept 13, 2023

This should be in the headline (the 2.5 bit part), without the qualifier the result means nothing.

tutfbhuf · on Sept 13, 2023

How is 2.5 bits possible? Is it an average between 2 and 3 over all weights?

ftxbro · on Sept 13, 2023

> Is it an average between 2 and 3 over all weights?

Yes I think it's an average where different quantization levels are used for different layers or weights. Here are more details about the quantization scheme: https://github.com/turboderp/exllamav2#exl2-quantization

3abiton · on Sept 13, 2023

Any benchmarks on performance?

mijoharas · on Sept 13, 2023

Can anyone provide any additional details on the EXL2[0]/GPTQ[1] quantisation, which seems to be the main reason for a speedup in this model?

I had a quick look at the paper which is _reasonably_ clear, but if anyone else has any other sources that are easy to understand, or a quick explanation to give more insight into it, I'd appreciate it.

[0] https://github.com/turboderp/exllamav2#exl2-quantization

[1] https://arxiv.org/abs/2210.17323

mijoharas · on Sept 13, 2023

Ok, this article seems to have a nice more approachable explanation[0]

[0] https://towardsdatascience.com/4-bit-quantization-with-gptq-...

qeternity · on Sept 13, 2023

> which seems to be the main reason for a speedup in this model

I wouldn't agree with this as there are substantial perf improvements on GPTQ models of ca. 20%.

atharv_jaju · on Sept 13, 2023

What is the best performing Codellama model on a Macbook with M1 chip?

Will Gguf 7Bn 4bit quantized model be good to run locally on the Mac?

syntaxing · on Sept 13, 2023

I would use Ollama. The answer depends on your RAM size. 13B at 4 bit just fits on a 16GB. 34B at 4bit will require 32GB.

Art9681 · on Sept 15, 2023

What is the best model that would run on a 64GB M2 Max? I'm interested in attempting 70B model but not gimped to 2bit if I can avoid it. Would I bet able to run 4bit 70B?

syntaxing · on Sept 16, 2023

The biggest issue with local LLM is that best is extremely relative. As in, best for what application? If you want a generic LLM that does ok at everything, you can try Vicuna. If you want coding, code llama 34B is really good. Surprisingly, the 13B code up isn’t not as good, but pretty darn close as code llama 34B.

atharv_jaju · on Sept 14, 2023

I have 8GB RAM. So I am able to run 7B 4bit at maximum.

I used GGUF + llama.cpp.

Will try out Ollama.

Thanks for the info!

lynguist · on Sept 13, 2023

Does each parameter need double the allocation size?

syntaxing · on Sept 13, 2023

Not really, a 13B 4 bit fits in about 8GB of RAM. But you have to factor in the unified memory aspect of M series chips. Your “vram” and ram are from the same pool. So your computer shares the same pool to run the OS and application hence why you can only load a 13B 4bit for 16GB RAM.

pulse7 · on Sept 13, 2023

How does "70B Llama 2" compare with ChatGPT 3.5/4.0 in reality? Are they on par? In all areas?

brucethemoose2 · on Sept 13, 2023

Some 70B finetunes excel at certain niches, like specific non-English languages, roleplay, fictional writing, or topics GPT4 would refuse to discuss. Some of these are hard to evaluate, but try out (for instance) MythoMax for fiction writing, Airoboros for "uncensored" general use, and Samantha for therapist style chat.

https://huggingface.co/models?sort=modified&search=70b

acheong08 · on Sept 13, 2023

I’ve been a away from LLMs for a while. Is there something like CivitAI for LLMs to find models for certain niches?

brucethemoose2 · on Sept 13, 2023

That is it ^. Search for the parameter count you want (like "70B" or "13B") and the format you desire ("GPTQ" or "GGUF")

Some users are really terrible about labeling their model cards though, and some models may not have any GPTQ/GGUF files (meaning you have to convert them yourself).

andai · on Sept 13, 2023

It seems to depend on the task. I'd say it beats GPT-3 in quality most of the time (but not in speed and cost!) and GPT-4 approximately never. It's perfect if what you need is "less good than GPT-4 but 50x cheaper."

wkat4242 · on Sept 13, 2023

The biggest benefit of llama is that you can get it uncensored so you don't get the weasely lawyer disclaimer crap all the time.

andai · on Sept 14, 2023

In my experience the Chat version of Llama 2 is significantly worse than the GPTs, on the "I'm afraid I can't do that" front. It doesn't lecture you as much, but suggests "how about we do [something unrelated and useless] instead?"

As for the plain text prediction version (which is the only uncensored one?), I haven't been able to get it to do anything useful, even when I provide examples (seems dumber than even ancient GPT-3?).

Also, I got some bizarre and disturbing outputs from the uncensored version, like it was trained on some very nasty inputs! I assume that's why they went so hard on the safety phase to compensate...

Palmik · on Sept 13, 2023

Where fine tuning shines imo:

Things that require consistency: e.g. you want the chat / output to have certain "personality", consistent level of conciseness or formatting.

Things where examples are hard to fit into prompt: e.g. summarization, or other longer form tasks.

High volume, simpler tasks: Various data extraction tasks.

Two of my side projects (links in bio) use AI for summarization, and indeed consistency is a big issue there.

Tenoke · on Sept 13, 2023

You can also fine-tune gpt-3.5 though.

Palmik · on Sept 13, 2023

Fine tuned gpt-3.5 is ~8x the price compared to regular gpt-3.5. It might be a good way to collect better training data though. :)

BoorishBears · on Sept 13, 2023

That's absolute dirt cheap. It's like 1/10th the price fine-tuned GPT-3 was launched at, and funnily enough cheaper than many of the original embeddings endpoints.

Palmik · on Sept 13, 2023

It's great that the costs are coming down, but it's still not economical for many activities.

The fact that some things had extreme margins before, and now they have less extreme margins, isn't a really good indicator.

BoorishBears · on Sept 13, 2023

They made a technological advancement that allowed them to provide better output cheaper and faster and passed the savings onto their users...

Interesting to spin that into "not a good indicator".

Palmik · on Sept 13, 2023

Yes, and that's great, but what I meant is that it can definitely be much cheaper still, especially if you do not need a model that needs to be able to do everything, but is specialized on one or a few use cases.

viking123 · on Sept 13, 2023

At least it's not completely censored, that alone is nice

wkat4242 · on Sept 13, 2023

And it can be gotten more uncensored on HuggingFace.

Not that I have evil intentions but the level of censorship on GPT is completely ridiculous. Even many innocent questions get the standard "I'm only an AI and I won't help you doing bad stuff" blurb now. OpenAI are really crazy overprotective of their darling.

I assume they want to avoid a repeat of the news headlines like "Microsoft's chatbot turns into Hitler" but really who cares. It didn't hurt Microsoft's AI efforts either. They just fixed it and continued. PS source link: https://www.cbsnews.com/news/microsoft-shuts-down-ai-chatbot...

PS: If I were an AI being force-fed what is currently on twitter I would also start hating humanz :D Sometimes I'm surprised people use it voluntarily.

Forgotthepass8 · on Sept 13, 2023

I once asked it for naming suggestions which referenced the concept of jailing a process for a concept I was working on

It refused because the request was "offensive to prisoners"

They must be paying a heavy alignment tax.

syntaxing · on Sept 13, 2023

I don’t have quantitative data but my employer recently released an internal llama v2 at 70B FP16. The internal front end allows us to switch between different LLMs. I would say it’s very on par and sometimes better than GPT3.5 for the task I use it for. You’ll get a lot of different answers here because not many people run 70B at FP16.

yieldcrv · on Sept 13, 2023

What’s a useful amount of tokens/second?

What’s a speed like ChatGPT 4?

I’m trying to understand if there are agreed upon metrics such as “frames per second” in other fields or niches

sigmoid10 · on Sept 13, 2023

>What’s a useful amount of tokens/second?

Depends entirely on your application. If you want a real time chatbot, anything more than 10 tokens/s is probably generating text faster than the user can read, so it's fine. Do you want code suggestions or completions? Probabaly also fine, but a bit slow. Do you want to create summaries for 1000 documents? That's gonna be really slow. But at this frontier of the field, tokens/s is not really the issue. Performance vs. quality is. If you lose a whole lot of accuracy by quantizing floats down to single digit precisions but in turn are able to run 70B parameter models, you often still get better results than less thoroughly quantized 7B parameter models. It's all about getting big models to run on memory limited GPUs.

brucethemoose2 · on Sept 13, 2023

More than ~10 tokens/s streamed feels OK to read. I can kinda live with 4-5 tokens/s (which is what I get on llama 70B on my 3090 with llama.cpp CPU offload).

What feels "good" depends on the person and the type of content, but 35 tokens/s should feel very fast.

enjeyw · on Sept 13, 2023

I wonder if anyone will end up employing the "human" solution of padding sentences out with filler words like 'Ummmm' or 'So like' while the generator is determining the actual next token...

visarga · on Sept 13, 2023

Once the model finished reading the prompt and starts answering tokens are generated at constant speed, there are no Ummm... gaps. But some tokens are already decided from the first layers and other tokens need the full depth. So there is a notion of "thinking harder" on particular tokens. We could detect token difficulty and fake the output to show the pauses.

NelsonMinar · on Sept 13, 2023

I've been learning a lot by watching the subreddit "Local LLaMA". Here's their discussion of the release: https://www.reddit.com/r/LocalLLaMA/comments/16gq2gu/exllama...

3abiton · on Sept 13, 2023

Unfortunate that reddit is becoming a close community with their recent api change. I hope things move to Lemmy.

htsh · on Sept 13, 2023

This subreddit remained open. Unfortunately, however, the oobabooga one went closed for a while and lost a lot of momentum. It is also back, however.

Are there good lemmy spaces for LLMs?

mplewis · on Sept 13, 2023

Community is a chicken-and-egg problem. I started my own Lemmy server to try to contribute and hold some space for discussion, but if you just want to start a community, it's as easy as making a new account on https://lemmy.world and clicking Create a Community on the front page.

kelvie · on Sept 13, 2023

For localllama this is the most active one so far (which isn't to say it is very active at all):

https://sh.itjust.works/c/localllama

If you post topics you'll generally get at least a few responses a day though.

0xDEF · on Sept 13, 2023

Is the LLM local hosting community big enough for Nvidia to release something equivalent to a 4070 but with 40-80 GB of VRAM? A small group of Microsoft Flight Simulator enthusiasts might also like such a GPU.

janekm · on Sept 13, 2023

It's more that the LLM non-local hosting (and training) community (on A100/H100) is big enough that it's more profitable for Nvidia to not release such prosumer GPUs ;)

ganoushoreilly · on Sept 13, 2023

I'm having the hardest time finding it, but I read a forum post about a few people that were able to add vram to existing 40 series cards after they finally cracked signing for the bios.

we're creeping up on the time where more consumer cards will have increases in vram (looks like 40 series supers will have a bump) and within the next couple of years i wouldn't be surprised to see mid priced cards with >= 24gb.

Exciting times for sure.

jensb1 · on Sept 13, 2023

Could someone explain how token generation speed relates to latency for the first token to be outputted?

And if anyone have any metrics on latency on a 4090 for the 70B model, that would be very helpful.

MezzoDelCammin · on Sept 13, 2023

4090Ti? AFAIK that has never been released

jensb1 · on Sept 13, 2023

redox99 · on Sept 13, 2023

Unrelated. What matters for that is prompt processing time (which is in the high hundreds of tokens per second).

msp26 · on Sept 13, 2023

I know it's not true but part of me believes that meta intentionally held back the 34B to make the community work on improving 13B and 70B model usage.

jmorgan · on Sept 13, 2023

That's fast. It's exciting to see more ways to run these models locally. How does this compare to llama.cpp – both in speed and approach?

lynguist · on Sept 13, 2023

What hardware and software would be recommended for a "good quality" local inference with a LLM on:

- a Windows or Linux machine - a Mac?

rootusrootus · on Sept 13, 2023

I think it depends a lot on what you're wanting to do. If you're just playing, you can get surprisingly good results from a 13B model and run it on a fairly basic M1 Mac. And on an M2 Ultra with a lot of RAM you can run some seriously large models at a good speed.

I compromised. Was going to build a PC with a nice graphics card just for LLM work, but in the end decided I didn't want the extra hardware and I could be happy with a powerful laptop that is my daily driver but also capable of running a good size LLM. Ended up with an M2 Max w/96GB of RAM. I can run 70B models (quantized at 4 bits) at a usable pace. Not 35 tokens/sec like this 4090 demo, but 5 tok/sec or so, which is usable for me.

smoldesu · on Sept 13, 2023

On Windows and Linux, you can run 7B models on pretty much any CPU with AVX support. Most x86 CPUs that support SSE can get usable acceleration on Llama.cpp.

ppsreejith · on Sept 13, 2023

IIUC this should work on the RTX 3090 as well (probably at less than 35tps)? Since the minimum requirement seems to be 24GB of VRAM

mejutoco · on Sept 13, 2023

In the table linked it says for "V2: 3090Ti" 30 t/s

CodeCompost · on Sept 13, 2023

What about the 4070 which 'only' has 12GB of VRAM?

brucethemoose2 · on Sept 13, 2023

Thats a good fit for 13B at the moment. Or older 30B models with CPU offloading.

I'm not certain 30B models will fit completely on 12GB, even with this quantization.

CodeCompost · on Sept 13, 2023

Maybe a stupid question, I'm not familiar with how video ports work, but does the concept of memory swapping exist for video cards? Or is that physically impossible?

Obviously that would be useless for games, but for LLMs it may be an option?

rwmj · on Sept 13, 2023

If you need to use all the memory then swapping breaks down, best you can do is try to divide the work so half can be done, then you completely swap all the memory and do the other half. Also PCIe is (relatively) slow for moving gigabytes of data back and forth.

brucethemoose2 · on Sept 13, 2023

Some frameworks have tried to do this automatically, but its extremely slow. You are better off just running whatever won't fit on the CPU (which is what llama.cpp does).

rahimnathwani · on Sept 13, 2023

Swapping would be too slow.

Llama.cpp allows you to specify that n layers load into the GPU's VRAM, and the rest load into main memory.

kilnr · on Sept 13, 2023

I get about the same speed (35 tok/s) out of 13B Llama2 on a 4070 Ti, FWIW.

atharv_jaju · on Sept 13, 2023

How do you count tokens per second?

csjh · on Sept 13, 2023

You measure how many tokens are output in a period of time and take the average?

atharv_jaju · on Sept 13, 2023

Oh my bad, I meant how to code it?

csjh · on Sept 13, 2023

Python pseudocode:

```py

output = prompt("My favourite animal is ") # returns a generator

start = time.time()

tokens = 0

for token in output:

    tokens += 1

    print(token)

print(f"Outputted {tokens / (time.time() - start)} tokens/second")

```

ChatGTP · on Sept 13, 2023

Upvoted