Data Parallel Scaling

When scaling to more GPUs, it is important to consider the impact this will have on your model training.

One important thing to consider is the potential change in effective batch size.

effective_batch_size = n_gpus * batch_size_per_gpu

Two common approaches to this are as follows:

  1. Maintain the original learning rate as well as the original effective batch size

To achieve this, you would need to lower the batch size per GPU. For example, if you are scaling from 32 GPUs to 64, halve the batch size per GPU.

  1. Scale the original learning rate to the new increased effective batch size

With increased effective batch size there is an opportunity to increase the learning rate to take advantage of the more stable gradient. In general, experimentation is required to determine the optimal increased learning rate. In our experience, a good starting heuristic is to increase the learning rate by the square root of the ratio of the new effective batch size to the original effective batch size.

For example, when scaling from an effective batch size of 32 to 128, the suggested new learning rate can be calculated as follows.

new_learning_rate = sqrt(128/32) * original_learning_rate

Last updated