Strong Docs
  • Welcome
  • Getting Started
    • 1. Registration and VPN
    • 2. Setting up your development environment
    • 3. Hello World
  • Basic Concepts
    • Organisation & Teams
    • Containers
    • Projects
    • Datasets
    • Launching Experiments
    • Experiment States
    • Artifacts
    • ISC Commands (CLI)
    • Resuming Experiments
    • Billing
  • Advanced
    • Clusters
    • Destinations
    • BYO Cloud API Keys
    • Cluster health logs
  • Training with ISC
    • Deep dive tutorial
    • Data Parallel Scaling
  • Use Cases
    • More Examples & Demos
  • Change Log
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
Powered by GitBook
On this page
  1. Training with ISC

Data Parallel Scaling

When scaling to more GPUs, it is important to consider the impact this will have on your model training.

One important thing to consider is the potential change in effective batch size.

effective_batch_size = n_gpus * batch_size_per_gpu

Two common approaches to this are as follows:

  1. Maintain the original learning rate as well as the original effective batch size

To achieve this, you would need to lower the batch size per GPU. For example, if you are scaling from 32 GPUs to 64, halve the batch size per GPU.

  1. Scale the original learning rate to the new increased effective batch size

With increased effective batch size there is an opportunity to increase the learning rate to take advantage of the more stable gradient. In general, experimentation is required to determine the optimal increased learning rate. In our experience, a good starting heuristic is to increase the learning rate by the square root of the ratio of the new effective batch size to the original effective batch size.

For example, when scaling from an effective batch size of 32 to 128, the suggested new learning rate can be calculated as follows.

new_learning_rate = sqrt(128/32) * original_learning_rate

PreviousDeep dive tutorialNextUse Cases

Last updated 10 months ago