# Data Parallel Scaling

When scaling to more GPUs, it is important to consider the impact this will have on your model training.

One important thing to consider is the potential change in effective batch size.

`effective_batch_size = n_gpus * batch_size_per_gpu`

Two common approaches to this are as follows:

1. **Maintain the original learning rate as well as the original effective batch size**

To achieve this, you would need to lower the batch size per GPU. For example, if you are scaling from 32 GPUs to 64, halve the batch size per GPU.

2. **Scale the original learning rate to the new increased effective batch size**

With increased effective batch size there is an opportunity to increase the learning rate to take advantage of the more stable gradient. In general, experimentation is required to determine the optimal increased learning rate. In our experience, a good starting heuristic is to increase the learning rate by the square root of the ratio of the new effective batch size to the original effective batch size.

For example, when scaling from an effective batch size of 32 to 128, the suggested new learning rate can be calculated as follows.

`new_learning_rate = sqrt(128/32) * original_learning_rate`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.strongcompute.com/training-with-isc/data-parallel-scaling.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
