Show HN: I made a GPU VRAM calculator for transformer-based models

ilaksh · on Dec 26, 2023

Does this have an option for quantization levels? Don't think I saw it.

furiousteabag · on Dec 27, 2023

There is no option to select quantized version yet. Will work on that!

ComputerGuru · on Dec 26, 2023

I second the request for quantization, eg for exl2.

samspenc · on Dec 26, 2023

Consumer grade GPUs like NVidia's 3090 and 4090 max out at 24 GB VRAM, and those cost $1000-2000 each. You can get higher VRAM but need enterprise GPUs which are in the five figures, easily starting at $30K a pop.

Per this calculator, for training, only gpt2-large and gpt2-medium would work with those two top-of-the-line GPUs.

For inference it's certainly a bit better, only the Llama-2-70b-hf and Llama-2-13b-hf don't fit in that much VRAM, all the other models do.

alexhutcheson · on Dec 27, 2023

Nvidia’s workstation cards are available with more RAM than the consumer cards, at a lower price than the datacenter cards. RTX 6000 Ada has 48 GB VRAM and retails for $6800, and RTX 5000 Ada has 32 GB VRAM and retails for $4000[1].

Very large models have to be distributed across multiple GPUs though, even if you’re using datacenter chips like H100s.

[1] https://store.nvidia.com/en-us/nvidia-rtx/store/

slabity · on Dec 27, 2023

Other than power consumption, is there any reason to prefer a single workstation card over multiple consumer cards then?

A single $6800 RTX 6000 Ada with 48GB of VRAM vs 6x 7900XTX with a combined total of 144GB of VRAM honestly makes this seem like a no brainer to me.

alexhutcheson · on Dec 27, 2023

You can only fit 1-2 graphics cards in a “normal” ATX case (each card takes 2-3 “slots”). If you want 4 cards on one machine, you need a bigger/more expensive motherboard, case, PSU, etc. I haven’t personally seen anyone put 6 cards in a workstation.

oceanplexian · on Dec 27, 2023

In a water cooled config the cards only take 1 slot. I’ve got 2 3090s and am buying another two shortly. Preemtively upgraded the power to 220v, found a 2kw PSU, and installed a dedicated mini split. I’m also undervolting the cards to keep power and heat down, because even 2000w is not enough to run 4 and a server grade CPU without tripping. When you start accumulating GPUs you also run into all kinds of thermal and power problems for the room, too.

kkielhofner · on Dec 27, 2023

This is impressive.

I was fortunate enough to scoop up a bunch of Gigabyte RTX 3090 Turbos. Cheap used eight slot SuperMicro (or whatever), a cabling kit, four 3090s, boot.

Those were the days!

alexhutcheson · on Dec 27, 2023

Sincere question: Is installing and running a mini split actually cheaper than racking them in a colo, or paying for time on one of the GPU cloud providers?

Regardless, I can understand the hobby value of running that kind of rig at home.

oceanplexian · on Jan 3, 2024

I personally haven’t done the calculation. I have rented colo space before and they are usually quite stingy on power. The other issue is, there’s a certain element to having GPUs around 24/7/365 to play with that I feel is fundamentally different than running it on a Cloud Provider. You’re not stressing out about every hour it’s running. I think in the long run (2yr+) it will be cheaper, and then you can swap in the latest and greatest GPU without any additional infrastructure cost.

ttt3ts · on Dec 27, 2023

You have to pass the context between GPUs for large models that don't fit in VRAM. Often ends up slower. Also, tooling around AMD GPUs is still poor in comparison.

mciancia · on Dec 27, 2023

Used 3090 are going for ~600usd these days (at least in Europe) thanks to crypto mining crash - building a workstation with 2 of these is fairly easy for 48GB of vram, with 4 a bit more tricky but still doable and affordable IMO

ngoro7bd · on Dec 27, 2023

Recently bought 24GB 3090 for $700 in US. Used but never tweaked, runs stable for 6 months despite heavy workloads.

nVidias play seems obvious. Game graphics don’t move that fast these days. Used market flush with 3090s and down is fine to them while they focus on extracting top dollar from fast moving AI researchers/VCs

namibj · on Dec 27, 2023

You can easily use pipeline-parallelism though. Especially if you have 8-16 lanes of PCIe4 with direct P2P access between the cards.

IIRC you want micro-batching though, to overlap pipeline phases.

nox100 · on Dec 27, 2023

I haven't a clue how they compare but a Studio Mac with an M2 Ultra can get 192GB of unified ram for $5700 (PS: not a mac fan, a curious)

cchance · on Dec 26, 2023

Very nice, would be cool to have a little i next to each spot to explain what each thing is for newer users (batch size, etc)

roseway4 · on Dec 26, 2023

While not as pretty (and mobile-friendly) as the original link, the calculators below support modeling LoRA-based training, alongside full finetuning.

https://huggingface.co/spaces/Vokturz/can-it-run-llm

https://rahulschand.github.io/gpu_poor/

ComputerGuru · on Dec 26, 2023

They seem to be broken when I try any HF ids besides what came preconfigured. e.g. just tried brucethemoose/Yi-34B-200K-DARE-merge-v5-3.1bpw-exl2-fiction or LoneStriker/shisa-7b-v1-3.0bpw-h6-exl2

3abiton · on Dec 27, 2023

Been looking for something like thos for a while! I googled a lot, and this link never popped up. I feel google search is regressing.

icelancer · on Dec 26, 2023

Second link hasn't been working for awhile.

a2128 · on Dec 26, 2023

I noticed the default parameter count value is 1.418 billion but if you erase it you can't actually enter it back because you can't type a decimal point in the input area. Also, you can't enter parameter counts smaller than 1 billion

sp332 · on Dec 26, 2023

It works if you type the digits first and then insert the decimal point after.

lgkk · on Dec 29, 2023

On mobile, iOS specifically in safari, your drop downs are hard to use. I’m not able to dismiss the keyboard. Is that an issue on my end?

thatguysaguy · on Dec 27, 2023

This only lists first moments, but Adam stores estimates of first and second moments.

furiousteabag · on Dec 27, 2023

By default, SGD w momentum is enabled as optimizer. You may try selecting Adam and it will list second moments as well.

_giorgio_ · on Dec 27, 2023

What is the usual way to do it inside the python file that defined the model?

twayt · on Dec 27, 2023

This is actually pretty useful

a_wild_dandan · on Dec 26, 2023

Are people still rawdoggin' 16-bit models? I almost exclusively use 5-bit inference quants (or 8-bit natives like Yi-34b) on my MacBook Pro. Tiny accuracy loss, runs fast, and leave plenty of (V)RAM on the table. Mixtral 8x7 is my new daily driver, and only takes like 40GB to run! I wonder if I could run two of them talking to each other...

rubatuga · on Dec 27, 2023

Pure 16bit is horrible for training, sorry.

rdedev · on Dec 27, 2023

Doesn't using bf16 alleviate the problem? At least I've had success training a Bert like model from scratch

furiousteabag · on Dec 27, 2023

Mixed precision is a default method to pretrain and full fine tune right now. It is especially good in transformers, because they have memory bottleneck in activations (outputs of intermediate layers stored for backprop), and running forward pass in fp16/bf16 reduces VRAM by almost half (speeds up forward pass as well).

shikon7 · on Dec 27, 2023

I wonder about that too. With the small precision, parameter updates might be too small to have an effect (is it possible to use some sort of probabilistic update in that case?) Unfortunately, I haven’t found any resources describing the feasibility of full fp16 or bf16 training.

furiousteabag · on Dec 27, 2023

You are correct, training sorely in fp16/bf16 can lead to imprecise weight updates or even gradients turning to zero. Because of that, mixed precision is used. In mixed precision training, we keep a copy of the weights in fp32 (master model) and the training loop looks like this: compute the output with the fp16 model, then the loss -> back-propagate the gradients in half-precision -> copy the gradients in fp32 precision -> do the update on the master model (in fp32 precision) -> copy the master model in the fp16 model. We also do loss scaling which means multiplying the output of the loss function by some scalar number before backprop (necessary in fp16 but not required in bf16).

Check out the fastai docs for more details: https://docs.fast.ai/callback.fp16.html

rdedev · on Dec 27, 2023

Ah my bad. I am using mixed precision training in the my previous comment.

You might find this paper interesting: https://arxiv.org/pdf/2010.06192.pdf

bigdict · on Dec 27, 2023

Hmm, what do you mean? I thought bf16 is used extensively for LLM training.

chrsig · on Dec 27, 2023

How does one rawdog a 16-bit model?

kkzz99 · on Dec 27, 2023

Usually, for efficiency, you use quantized models. Quantized models reduce the number of bits available for each parameter, saving space and reduce RAM usage.