Grid Volunteer Computing For A.I.
Distributed Computing, Grid Computing
- Berkeley Open Infrastructure for Network Computing - BOINC
- Internet Archive, they maintain a library for everyone to access for free. https://warrior.archiveteam.org/
- European Grid Infrastructure (EGI) - a series of projects funded by the European Commission.
- OurGrid - https://ourgrid.org/
- Petals, Run large language models at home, BitTorrent‑style - https://petals.dev/, https://huggingface.co/
Distributed storage network (DSN)
Filesize is 10mb, 10 computers hold 1mb each. Like RAID striping over a wide area network. Running commands over the data rather than downloading the data such as a search request, SQL, regex, awk or to train an LLM.
IPFS (InterPlanetary File System): Employs a content-addressed system with erasure coding to distribute files efficiently across a peer-to-peer network.
- TensorFlow: TensorFlow provides the tf.distribute API for distributed training. You can choose strategies like MirroredStrategy or MultiWorkerStrategy depending on your setup (single machine with multiple GPUs or multiple machines).
- PyTorch: PyTorch offers the torch.distributed package for distributed training across GPUs and machines. Choose between MPI, GLOO, or NCCL backends based on communication needs.
- Horovod: This open-source library simplifies distributed training across diverse platforms (CPUs, GPUs, clusters) and works with TensorFlow, PyTorch, and MXNet.
- Ray Tune: This tool offers distributed Hyperparameter Tuning, which can optimize your LLM training by running several trials on separate machines.
Data parallelization: Divide your training data across machines to distribute the workload and speed up training.
Model parallelization: For even larger models, consider splitting the model itself across machines for parallel processing.