My only problem with stable horde is that their anti-cp measure involves checking the prompt for words like small, meaning I can't use a nsfw-capable model with certain prompts (holding a very small bag, etc). That, and seeing great things in the image rating and being unable to reproduce because it doesn't provide the prompt.
There absolutely are! Check out hivemind (https://github.com/learning-at-home/hivemind), a general library for deep learning over the Internet, or Petals (https://petals.ml/), a system that leverages Hivemind and allows you to run BLOOM-176B (or other large language models) that is distributed over many volunteer PCs. You can join it and host some layers of the model by running literally one command on a Linux machine with Docker and a recent enough GPU.
Disclaimer: I work on these projects, both are based on our research over the past three years
Not directly related, but the Learning@home [1] project aims to achieve precisely that goal of public, volunteer-trained neural networks. The idea is that you can host separate "experts," or parts of your model (akin to Google's recent Switch Transformers paper) on separate computers.
This way, you never have to synchronize the weights of the entire model across the participants — you only need to send the gradients/activations to a set of peers. Slow connections are mitigated with asynchronous SGD and unreliable/disconnected experts can be discarded, which makes it more suitable for Internet-like networks.
Disclaimer: I work on this project. We're currently implementing a prototype, but it's not yet GPT-3 sized. Some issues like LR scheduling (crucial for Transformer convergence) and shared parameter averaging (for gating etc.) are tricky to implement for decentralized training over the Internet.
Your project looks so interesting.
Have u thught of putting the experts on a distributed market where their expertise and work can be exchanged for some token (obvlsy using a blockchain).
This would encourage people to host experts in your network and would create value.
Thank you! This is definitely something we should look into in the future (hopefully with community help); as of now, training infrastructure and model convergence are the highest priorities. That said, we welcome all ideas of ways to motivate more volunteers to join the experiments, because Learning@home team comes from a distributed DL background with limited volunteer computing expertise.
Also, I believe that for some projects (e.g. GPT-3 replication effort) people would want to join the network regardless of the incentive mechanism, as demonstrated by Leela Chess Zero [1].
How do you deal with adversarial/byzantine updates that attempt to degrade performance or even install a backdoor? Do you use plain averaging, or some other aggregation algorithm like Multi-Krum?
For now, the only separation we have is that each worker is responsible for its own weights, since network security has not been our top priority. Still, we've been thinking about adding some security measures like proof-of-work for each node and detection of anomalous inputs/gradients (or simply NaN values). Right now we're running experiments on internal hardware, but before a public launch we'll make sure that malicious participants won't put everybody else's work to waste :)
This is also what I was thinking about. Considering that making up bad data does not require any GPU work as opposed to honest calculating nodes, the model can fall quickly if without taking some measures to deal with them (adverserial nodes).
A draft solution would be for the central server to measure the goodness of each update and drop the ones that don't perform well. This could somehow work since inference is much cheaper than gradients computing.
Here is a recent paper (disclaimer: I am the first author) named "Learning@home" which proposes something along these lines. Basically, we develop a system that allows you to train a network with thousands of "experts" distributed across hundreds or more of consumer-grade PCs. You don't have to fit 700GB of parameters on a single machine and there is significantly less network delay as for synchronous model parallel training. The only thing you sacrifice is the guarantee that all the batches will be processed by all required experts.