Why this is interesting: To my knowledge, this is the first time you are able to run the largest Llama, at competitive speed, on a consumer GPU (or something like A40).
On many tasks, fine-tuned Llama can outperform GPT-3.5-turbo or even GPT-4, but with a naive approach to serving (like HuggingFace + FastAPI), you will have hard time beating the cost of GPT-3.5-turbo even with the smaller Llama models, especially if your utilization is low.
running 70b on 24gb has been possible for months (so basically forever) it just got like 2x speedup now. the issue that it's a lobotomized ~2bit quant remains.
Quantisation tradeoff is an interesting question. My experience so far is that larger models with lower bit quantisation seem to perform better than smaller models with higher bit quantisation. So, e.g., 30B-4bit appears to perform better than 13B-8bit. But, then, an 8-bit quantisation seems not really distinguishable from the non-quantised model, so what benefit might you gain from not quantising the model.
Of course, speed, accuracy, power consumption, etc are all considerations, sometimes at odds with each other.
What EXL2 seems to bring to the table is that you can target an arbitrary quantize bit-weight (eg, if you're a bit short on VRAM, you don't need to go from 4->3 or 3->2, but can specify say 3.75bpw). You have some control w/ other schemes by setting group size, or with k-quants, but EXL2 is definitely allows you to be finer grained. I haven't gotten a chance to sit down with EXL2 yet, but if no one else does it, it's on my todo-list to be able to do 1:1 perplexity and standard benchmark evals on all the various new quantization methods, just as a matter of curiosity.
This paper[1] attempts to do LoRA fine tuning on quantised LLMs. The LoRa weights themselves are full precision, but they're much smaller. Although, I haven't had a chance to try it yet.
Yup, but the thing is that can you get model working how you want and can you run it production. My experience is that gpt4 yes, gpt3.5 yes but needs some work Llama2-70b 8-bit or more yes but needs some work after that it really depends usecase.
Saying fine-tuned LLaMA can outperform GPT-3.5-turbo on some tasks would be reasonably generous.
Saying it can outperform GPT-3.5-turbo on "many tasks" would be venturing into unreasonable territory.
Saying it can outperform GPT-3.5-turbo or even GPT-4 on many tasks is just setting people up for disappointment: There's a reason ARC (common sense reasoning) never gets mentioned when touting OS LLMs vs commercial, and the gap in ARC performance is amplified terribly when you try to apply chain-of-thought
Still below 3.5 and like most top entries no clear training objectives, which usually means they were fine-tuned on the benchmark itself.
-
ARC is easily the most important benchmark right now for widespread adoption of LLMs by laypeople: it's literal grade school multiple choice, created to the bar of an 8th grader.
When you sit people down in front of an LLM and have them interact with it: ARC has by far the closest correlation of how "generally smart" the model will feel.
I am totally with you when it comes to reasoning, or when it comes to generally / universally capable models. But many problems don't require reasoning, and are very concrete / specific.
Advanced code generation, the kind you would need for something like AutoGPT or other agent-like system, is definitely reasoning and GPT-4 or Claude 2 will dominate. But most of the code generation from GitHub Copilot or other systems (like Replit) is much simpler pattern matching (and indeed, Replit afaik uses very small models).
My personal use case is most often supplying files and generating unit tests, as well as supplying a section of code with a prompt about that section to evaluate the approach contextually with some other unseen code. Which perhaps is another form of reasoning.
Sure, there are many ways one can be accidentally inefficient, some examples:
Batching: Your throughput (and thus cost per token) will be much worse if you don't do batching (meaning running inference for several inputs at once). To do batching, you either need to have a high sustained QPS or adjust your workload to be bursty (but this has utilization issues).
Inference engine: The regular HuggingFace, even with built in optimizations, is not competitive against inference engines like vllm, exllama, etc.
Utilization: Depending on your scale, it might be hard for you to have a nice flat utilization, or to even utilize one GPU at 100% capacity. This means you need to solve scaling up & down your machines and the issues associated with it (can you live with cold starts? etc.)
Hardware: Forget server-less GPUs like Replicate etc, their markups compared to on-demand pricing is usually >10x. On-demand A100 or H100, provided you can even get a quota for them, are also expensive. Spot instances are better. Older GPUs (A10, T4, etc.) are better. Your own GPU cluster is likely the best if you have the scale (and likely, you can resell your cluster within the next 18months without much if any depreciation).
For these reasons, I have been toying with the idea of providing a dead simple service for fine-tuning & serving open-source LLMs where the users actually own (and can download) the weights. If anyone is interested in this and would like to chat, let me know.
Interested in this. Are you thinking of it being a SaaS type thing? I have some rudimentary experience, i.e. training NanoGPT on both Modal.com and Vast.ai (which is vastly cheaper, but privacy/cost trade off there).
It could be a layer on top of $gpu_cloud (so end user signs up for the cloud) and the service makes use of those resources.
Not on the same platform so hard to make any solid conclusions, but vs the 4090 on a perf/$ basis right now, it's not too bad - about 60% of the inference performance of a 4090 at 45% the price. Still, if ML is your main use case, I think it's hard to argue that a used 3090 wouldn't be the way to go.
yeah, a used 3090 would be the most likely candidate if my next build was something for personal use only. Thing is, I'm probably going to expense it. Which limits me to new components.
And I'm not in too much of a rush (weeks / short months). Can afford to wait for a month / two to see how the ROCm support pans out on the AMD side. 60% performance for <50% price is a compelling argument. If the ML support is decent...
Though the 4090 is somewhat expensive, I might end up taking it anyway just for the simple reason that it's the only 24GB card with CUDA.
Makes sense to me, although if you're using it professionally as an individual, then I would also seriously consider a last-gen A6000 (about $4K) - despite advances in quants, I'd say 48GB is a better minimum bar atm especially if you want to use long context, run multiple model types, do local fine tuning, etc.
yeah, that's the thing. The AI/ML is something I'd like to play with / learn / experiment on, but it's only tangential to my current paying job at this point in time (I'm basically a freelancer).
"Expense" in my context means that I'll deduce about 50% off of the invoice come taxes time. I'm OK paying $400 of my own money for an AMD card, I'd think twice, but probably accept coughing up $800 for a 4090, but $2k is definitely over the budget at this stage.
Thanks for sharing the spreadsheet with Your benchmarks though. You're going into my "HN commenters to watch" list ;-)
I'm not sure where you're getting your pricing from but in the US a new 4090 is at least $1600, a 7900 XTX is about $1000 (a 20GB 7900 XT is $750-800). Used 3090s are usually about $700.
If you're just dipping your toe in I'd recommend using Google Colab or cloud rentals like Runpod/Vast.ai which will be more cost effective if you're not running 24/7 or want to have access immediately on tap.
Ah, sorry, I was talking about the money I would pay after the expense deduction of 50%. So, 4090 at $1600 would cost me about $800 of my own money in the end. That being said, I'm writing from the EU, so all these numbers are really just ballparks where I take 1EUR==1USD for the sake of simplicity.
Thanks for the rental tips though. I'll take a look. It's indeed more effective in the beginning...
I wonder what are the practical concerns of loading up an AMD laptop (or I guess one of those mini PCs) with an APU/integrated GPU and allocating 70+GB of ram to them as VRAM.
Wouldn't that, in theory, be what consumer LLM users want? A mid to low range GPU with tons of VRAM?
Interesting how this method quantizes different layers / modules in a manner that minimizes perplexity as it adjusts parameters.
I'd be interested to see how 2.5 bit quantization compares to an unadjusted 4-bit baseline. Additionally would be interesting to see the usual benchmarks (ARC, HellaSwag, MMLU, TruthfulQA) of this method at different average bitrates (2.0, 2.5, 3.0, 4.0, ...). Would also be interesting to see if an average bitrate of 4 is just as fast and small as a constant bitrate of 4, but more accurate.
Very exciting work, looking forward to trying this out on my models!
Wow, thank you for digging that up, I suspected there might be some gain there but from the abstract it looks like I was off by a massive amount in my naive estimate of what might be achievable.
"Theoretical analysis shows that XOR-Net
reduces one-third of the bit-wise operations compared with
traditional binary convolution, and up to 40% of the full-
precision operations compared with XNOR-Net. Experimental
results show that our XOR-Net binary convolution without
scaling factors achieves up to 135× speedup and consumes no
more than 0.8% energy compared with parallel full-precision
convolution."
The big question is whether or not the results are still as good. It would be super interesting to see whether applying this to LLMs would give comparable benefits.
Pardon me for asking a bunch of consumer-gpu questions:
1) is there any architectural difference between the 3090 and the 4090? Are there models that will run on 3 series that will not run on 4 series cards due to feature incompatibility? or is it just raw speed/power efficiency?
2) And; on raw speed. My understanding is that the 4090 is roughly 2x faster than the 3090, right? Are there any things in particular where the 4090 is much much faster (such as, big chip architecture changes that really accelerate some operations)?
3) And somewhat relatedly, if you have a 2x 3090 rig, is that about as fast as a single 4090?
4) And; if you have a 2x 3090 rig, are you able to train larger models? My understanding is that training/finetuning requires significantly more VRAM (and that is because inference can be done with quantization, whereas training must be done at full fp16 precision, right?) But can you train e.g. a model that requires 48gb of space with 2 3090s on the same rig, or do you need a large card with 48gb, like an a4000?
> Are there models that will run on 3 series that will not run on 4 series cards due to feature incompatibility?
No, it is all the same matrix multiplication.
> if you have a 2x 3090 rig, is that about as fast as a single 4090?
Yes, but double the VRAM
> But can you train e.g. a model that requires 48gb of space with 2 3090s on the same rig, or do you need a large card with 48gb, like an a4000?
Yes, 2x 3090 would be able to handle anything that a 48Gb card can. For training you will need space for weights and gradients. If using Adam optimizer, that would be 2x-4x model size. Plus weights, plus activations, plus inputs times batch size.
So a 24Gb card can train approximately a 3B model without compromises.
This is fantastic. it's really nice to see the different speed comparisons. Does anyone know how a large model performs on some serious hardware? How much faster are these top-grade Nvidia cards than a 4090? Do models that require multiple cards to run in memory run with equivalent performance of the multiple cards or are they throttled by the memory access?
> Do models that require multiple cards to run in memory run with equivalent performance of the multiple cards or are they throttled by the memory access?
It depends, but some frameworks just run the layers sequentially. So you get ~50% utilization from 2 GPUs unless you pipeline requests from multiple users
But at what cost? Have there been any perplexity benchmarks for ELX2?
Its annoying that Facebook made llama v2 70B instead of 65B, even with the memory saving changes... 5B less parameters, and the squeeze would be far easier.
Good point -- hopefully the quality impact is still worth it, remains to be seen. Agree on the size -- hopefully something they will keep in mind for future models.
Can anyone provide any additional details on the EXL2[0]/GPTQ[1] quantisation, which seems to be the main reason for a speedup in this model?
I had a quick look at the paper which is _reasonably_ clear, but if anyone else has any other sources that are easy to understand, or a quick explanation to give more insight into it, I'd appreciate it.
What is the best model that would run on a 64GB M2 Max? I'm interested in attempting 70B model but not gimped to 2bit if I can avoid it. Would I bet able to run 4bit 70B?
The biggest issue with local LLM is that best is extremely relative. As in, best for what application? If you want a generic LLM that does ok at everything, you can try Vicuna. If you want coding, code llama 34B is really good. Surprisingly, the 13B code up isn’t not as good, but pretty darn close as code llama 34B.
Not really, a 13B 4 bit fits in about 8GB of RAM. But you have to factor in the unified memory
aspect of M series chips. Your “vram” and ram are from the same pool. So your computer shares the same pool to run the OS and application hence why you can only load a 13B 4bit for 16GB RAM.
Some 70B finetunes excel at certain niches, like specific non-English languages, roleplay, fictional writing, or topics GPT4 would refuse to discuss. Some of these are hard to evaluate, but try out (for instance) MythoMax for fiction writing, Airoboros for "uncensored" general use, and Samantha for therapist style chat.
That is it ^. Search for the parameter count you want (like "70B" or "13B") and the format you desire ("GPTQ" or "GGUF")
Some users are really terrible about labeling their model cards though, and some models may not have any GPTQ/GGUF files (meaning you have to convert them yourself).
It seems to depend on the task. I'd say it beats GPT-3 in quality most of the time (but not in speed and cost!) and GPT-4 approximately never. It's perfect if what you need is "less good than GPT-4 but 50x cheaper."
In my experience the Chat version of Llama 2 is significantly worse than the GPTs, on the "I'm afraid I can't do that" front. It doesn't lecture you as much, but suggests "how about we do [something unrelated and useless] instead?"
As for the plain text prediction version (which is the only uncensored one?), I haven't been able to get it to do anything useful, even when I provide examples (seems dumber than even ancient GPT-3?).
Also, I got some bizarre and disturbing outputs from the uncensored version, like it was trained on some very nasty inputs! I assume that's why they went so hard on the safety phase to compensate...
That's absolute dirt cheap. It's like 1/10th the price fine-tuned GPT-3 was launched at, and funnily enough cheaper than many of the original embeddings endpoints.
Yes, and that's great, but what I meant is that it can definitely be much cheaper still, especially if you do not need a model that needs to be able to do everything, but is specialized on one or a few use cases.
And it can be gotten more uncensored on HuggingFace.
Not that I have evil intentions but the level of censorship on GPT is completely ridiculous. Even many innocent questions get the standard "I'm only an AI and I won't help you doing bad stuff" blurb now. OpenAI are really crazy overprotective of their darling.
I assume they want to avoid a repeat of the news headlines like "Microsoft's chatbot turns into Hitler" but really who cares. It didn't hurt Microsoft's AI efforts either. They just fixed it and continued. PS source link: https://www.cbsnews.com/news/microsoft-shuts-down-ai-chatbot...
PS: If I were an AI being force-fed what is currently on twitter I would also start hating humanz :D Sometimes I'm surprised people use it voluntarily.
I don’t have quantitative data but my employer recently released an internal llama v2 at 70B FP16. The internal front end allows us to switch between different LLMs. I would say it’s very on par and sometimes better than GPT3.5 for the task I use it for. You’ll get a lot of different answers here because not many people run 70B at FP16.
Depends entirely on your application. If you want a real time chatbot, anything more than 10 tokens/s is probably generating text faster than the user can read, so it's fine. Do you want code suggestions or completions? Probabaly also fine, but a bit slow. Do you want to create summaries for 1000 documents? That's gonna be really slow. But at this frontier of the field, tokens/s is not really the issue. Performance vs. quality is. If you lose a whole lot of accuracy by quantizing floats down to single digit precisions but in turn are able to run 70B parameter models, you often still get better results than less thoroughly quantized 7B parameter models. It's all about getting big models to run on memory limited GPUs.
More than ~10 tokens/s streamed feels OK to read. I can kinda live with 4-5 tokens/s (which is what I get on llama 70B on my 3090 with llama.cpp CPU offload).
What feels "good" depends on the person and the type of content, but 35 tokens/s should feel very fast.
I wonder if anyone will end up employing the "human" solution of padding sentences out with filler words like 'Ummmm' or 'So like' while the generator is determining the actual next token...
Once the model finished reading the prompt and starts answering tokens are generated at constant speed, there are no Ummm... gaps. But some tokens are already decided from the first layers and other tokens need the full depth. So there is a notion of "thinking harder" on particular tokens. We could detect token difficulty and fake the output to show the pauses.
Community is a chicken-and-egg problem. I started my own Lemmy server to try to contribute and hold some space for discussion, but if you just want to start a community, it's as easy as making a new account on https://lemmy.world and clicking Create a Community on the front page.
Is the LLM local hosting community big enough for Nvidia to release something equivalent to a 4070 but with 40-80 GB of VRAM? A small group of Microsoft Flight Simulator enthusiasts might also like such a GPU.
It's more that the LLM non-local hosting (and training) community (on A100/H100) is big enough that it's more profitable for Nvidia to not release such prosumer GPUs ;)
I'm having the hardest time finding it, but I read a forum post about a few people that were able to add vram to existing 40 series cards after they finally cracked signing for the bios.
we're creeping up on the time where more consumer cards will have increases in vram (looks like 40 series supers will have a bump) and within the next couple of years i wouldn't be surprised to see mid priced cards with >= 24gb.
I know it's not true but part of me believes that meta intentionally held back the 34B to make the community work on improving 13B and 70B model usage.
I think it depends a lot on what you're wanting to do. If you're just playing, you can get surprisingly good results from a 13B model and run it on a fairly basic M1 Mac. And on an M2 Ultra with a lot of RAM you can run some seriously large models at a good speed.
I compromised. Was going to build a PC with a nice graphics card just for LLM work, but in the end decided I didn't want the extra hardware and I could be happy with a powerful laptop that is my daily driver but also capable of running a good size LLM. Ended up with an M2 Max w/96GB of RAM. I can run 70B models (quantized at 4 bits) at a usable pace. Not 35 tokens/sec like this 4090 demo, but 5 tok/sec or so, which is usable for me.
On Windows and Linux, you can run 7B models on pretty much any CPU with AVX support. Most x86 CPUs that support SSE can get usable acceleration on Llama.cpp.
Maybe a stupid question, I'm not familiar with how video ports work, but does the concept of memory swapping exist for video cards? Or is that physically impossible?
Obviously that would be useless for games, but for LLMs it may be an option?
If you need to use all the memory then swapping breaks down, best you can do is try to divide the work so half can be done, then you completely swap all the memory and do the other half. Also PCIe is (relatively) slow for moving gigabytes of data back and forth.
Some frameworks have tried to do this automatically, but its extremely slow. You are better off just running whatever won't fit on the CPU (which is what llama.cpp does).
On many tasks, fine-tuned Llama can outperform GPT-3.5-turbo or even GPT-4, but with a naive approach to serving (like HuggingFace + FastAPI), you will have hard time beating the cost of GPT-3.5-turbo even with the smaller Llama models, especially if your utilization is low.