Grid Volunteer Computing For A.I.

This revision is from 2024/01/26 20:12. You can Restore it.

Distributed Computing, Grid Computing

  1. Berkeley Open Infrastructure for Network Computing - BOINC
  2. Internet Archive, they maintain a library for everyone to access for free. https://warrior.archiveteam.org/
  3. European Grid Infrastructure (EGI) - a series of projects funded by the European Commission.
  4. OurGrid - https://ourgrid.org/
  5. Petals, Run large language models at home, BitTorrent‑style - https://petals.dev/, https://huggingface.co/

Distributed storage network (DSN)

Filesize is 10mb, 10 computers hold 1mb each. Like RAID striping over a wide area network. Running commands over the data rather than downloading the data such as a search request, SQL, regex, awk or to train an LLM.

IPFS (InterPlanetary File System): Employs a content-addressed system with erasure coding to distribute files efficiently across a peer-to-peer network.

  1. TensorFlow: TensorFlow provides the tf.distribute API for distributed training. You can choose strategies like MirroredStrategy or MultiWorkerStrategy depending on your setup (single machine with multiple GPUs or multiple machines).
  2. PyTorch: PyTorch offers the torch.distributed package for distributed training across GPUs and machines. Choose between MPI, GLOO, or NCCL backends based on communication needs.
  3. Horovod: This open-source library simplifies distributed training across diverse platforms (CPUs, GPUs, clusters) and works with TensorFlow, PyTorch, and MXNet.
  4. Ray Tune: This tool offers distributed Hyperparameter Tuning, which can optimize your LLM training by running several trials on separate machines.

Data parallelization: Divide your training data across machines to distribute the workload and speed up training.

Model parallelization: For even larger models, consider splitting the model itself across machines for parallel processing.

  

📝 📜 ⏱️ ⬆️