Training Transformers Together
large-scale deep learning for everyone, by everyone
A NeurIPS 2021 Demonstration
There was a time when you could comfortably train state-of-the-art vision and language models at home on your workstation. The first convolutional neural net to beat ImageNet (AlexNet) was trained for 5-6 days on two gamer-grade GPUs. In contrast, today's Top-1 ImageNet model (CoAtNet) takes 20,000 TPU-v3 days. And things are even worse in the NLP world: training GPT‑3 on a top-tier server with 8x A100 would take decades.
So, can individual researchers and small labs still train state-of-the-art models? Yes we can! All it takes is for a bunch of us to come together. In fact, we're doing it right now and you are invited to join!
In this demo, we train a model similar to OpenAI DALL-E — a Transformer model that generates images from text descriptions. It is trained on LAION-400M, the world's largest openly available image-text-pair dataset with 400 million samples. Our model is based on the dalle‑pytorch implementation by Phil Wang with a few tweaks to make it communication-efficient.
Modern distributed training algorithms are designed for HPC clusters with a 10-100 gigabit per second bandwidth. In turn, a typical Internet connection runs at 10-100 megabits per second: that’s three orders of magnitude slower. To make distributed training efficient, you need to win back these three orders of magnitude. This may seem daunting at first, but in reality, DL researchers have already made all the necessary pieces for solving this puzzle:
|Speed‑up||How to achieve|
|4-16x||Large-batch training: You et al. (2019) proposed a way for training neural networks efficiently with larger batches, and hence, fewer communication rounds.|
|4-32x||Gradient compression: from simple 8-bit quantization to advanced techniques such as Deep Gradient Compression, PowerSGD, 1-bit Adam, and many others. As a rule of thumb, these techniques can safely reduce communication by 16-32x. More extreme compression is often possible, but it may affect stability or final quality.|
|4-24x||Parameter sharing: reusing parameters between model layers results in a model with fewer parameters, and hence, fewer gradients to communicate. Lan et al. (2019) and Xue et al. (2021) propose efficient parameter sharing architectures for NLP and computer vision.|
|1.5-2x||Overlapping computation with communication: running network communication in background while computing the next portion of gradients. This is a long-standing trick from HPC that was recently adapted for DL training. Ren et al. (2021) show that updating parameters in background while computing the next batch of gradients does not harm convergence.|
These techniques are already more than enough to cover 1000x slower communication. This means that in practice you can pick and choose which of them you want in your training run. For this demo, we use 8x larger batches, 4x compression, 12x parameter sharing and partial overlapping. If you don’t want parameter sharing, you can trade it for more advanced gradient compression or larger batches.
Most distributed DL frameworks assume that the computation is performed by a fleet of identical devices, typically GPU servers or TPU cores. Under this assumption, each device can be assigned an equal part of computation, such as processing a fixed batch size of training samples. However, this quickly breaks down if workers use different device types. If one participant uses a GPU (e.g. P100) and another runs on TPU-v2-8, it is difficult to find a regime where both devices will be fully utilized.
To make the best use of all available devices, we let each device accumulate gradients at its own pace with individually tuned batch size and some other features (e.g. gradient checkpointing or using XLA). Once workers collectively aggregate some predefined global batch size, they average their gradients with weights proportional to each worker's individual contribution (i.e. number of samples processed).
This technique allows the "swarm" to automatically adjust its behavior as peers join, leave or fail. For instance, if several high-performance peers join the experiment, other peers will need to process a smaller number of samples per optimizer step, and hence, the collaboration will train faster with the same hyperparameters. In turn, if one of the workers fails and loses its progress (e.g. due to a fp16 overflow), others will make up for that by processing slightly more. For more details on how this works, please refer to "Deep Learning In Open Collaborations" paper or the corresponding blog post.
This section will be updated on December 7.