The problem with that is currently, the available memory scales with the class o...

buildbot · on Jan 11, 2023

Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.

And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).

Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.

sbrother · on Jan 12, 2023

Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?

pizza · on Jan 12, 2023

Bloom-petals

sbrother · on Jan 12, 2023

Amazing. Thank you.

pizza · on Jan 13, 2023

No prob. I think it’s a great idea

amelius · on Jan 11, 2023

I guess this would only become a reality if games started requiring these cards.