We're building an end-to-end multi-modal MoE that works in 3D parallelism, and do pre-training in a decentraized way as proposed in the paper DiLoCo
If you want to contribute, please check the following links
We're building an end-to-end multi-modal MoE that works in 3D parallelism, and do pre-training in a decentraized way as proposed in the paper DiLoCo
If you want to contribute, please check the following links