Pooling resources a la SETI@home would be an interesting option I would love to ...

simonw · on March 31, 2023

My understanding is that can work for model inference but not for model training.

https://github.com/bigscience-workshop/petals is a project that does this kind of thing for running inference - I tried it out in Google Collab and it seemed to work pretty well.

Model training is much harder though, because it requires a HUGE amount of high bandwidth data exchange between the machines doing the training - way more than is feasible to send over anything other than a local network connection.

robertlagrant · on April 1, 2023

And a lot of expensive data scientists.

PeterisP · on April 1, 2023

This is the type of task where if you'd want to pool resources, then it would be more efficient to pool dollars and buy compute power rather than pool compute power - I'd assume that if treat the decentralized hardware as free, just the the extra electricity cost of using it is more expensive than just renting a centralized server which can do it efficiently.

shagie · on April 1, 2023

SETI@home (and similar projects) fall into the domain of embarrassingly parallelizable ( https://en.wikipedia.org/wiki/Embarrassingly_parallel ).

My own experience with this was a distributed ray tracer where the server sent the full model to the machines and then each machine would ask for one scan line to do, report back, and then ask for another scan line and repeated.

There was no interaction between the machines - what was on one scan line didn't need any coordination with what was on another scan line.

Likewise, with SETI@home, the server could give you a chunk of data and you could analyze that chunk - the contents of another chunk of data didn't change the analysis being done on this one.

Furthermore, these can be done asynchronously and then assembled when everything is done. Only the very final product / analysis / artifact needs all of the data and nothing other than the end process is waiting on any sub process.

For doing gradient descent ( https://www.3blue1brown.com/lessons/gradient-descent ), as I understand it, each iteration is dependent on the previous one.

Doing 13,002 dimensional (for the example of a 784 -> 16 -> 16 -> 10 neuron net digit recognizer in the 3b1b page) matrix math is the parallel part... but and if you get into the billions of parameters it gets much larger. Matrix multiplication has difficulty across a network. For example - http://www.lac.inpe.br/~stephan/CAP-372/Fox_example.pdf and http://www.cs.csi.cuny.edu/~gu/teaching/courses/csc76010/sli...

> We are now ready for the second stage. In this stage, we broadcast the next column (mod n) of A across the processes and shift-up (mod n) the B values.

That use of "broadcast" - the matrix multiplication is limited by the speed of the slowest node and it needs to send all the data from the previous calculation to all the nodes making it difficult to use across a network that experiences latency.

When doing ML training, they most of TB/sec of bandwidth... and the high end extremes are in PB/sec ( https://www.cerebras.net/product-chip/ ) ... and I'm sitting here watching Steam download.

The inefficiencies of the network, slow computers, and amount of data transfer to preform the next calculation make network distributed machine learning "not a good choice" at this time.