That's a great comparison. For a real number, I just checked Runpod and you can ...

amelius · on Jan 11, 2023

It would be great if a tradeoff could be made, though. For example, train at 1/10th the speed for 1/10th of the cost.

This could correspond to taking public transport in your analogy, and would bring this within reach of most students.

londons_explore · on Jan 11, 2023

Slower training tends to be only a little cheaper, because most modern architectures parallelize well, and they just care about the number of flops.

If you want to reduce cost, you need to reduce the model size, and you'll get worse results for less money.

mk_stjames · on Jan 11, 2023

The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.

I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.

buildbot · on Jan 11, 2023

Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.

And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).

Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.

sbrother · on Jan 12, 2023

Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?

pizza · on Jan 12, 2023

Bloom-petals

sbrother · on Jan 12, 2023

Amazing. Thank you.

pizza · on Jan 13, 2023

No prob. I think it’s a great idea

amelius · on Jan 11, 2023

I guess this would only become a reality if games started requiring these cards.

mcbuilder · on Jan 11, 2023

Well if it used to cost you $1 for 1hr at 1x speed, now it will take you 10hr at 0.1x speed, and if my math checks out $1. You need to shrink the model.

amelius · on Jan 11, 2023

But of course now you run it on your own computer instead of in the DC, which changes the numbers. Especially if your student dorm has a shared electricity bill :)

willseth · on Jan 11, 2023

The good news is that, unlike vehicles, the rate for rented compute will continue to drop