The Objective Dad
commited on
Adding NCCL Timeout Guide (#536)
Browse files* fixes NCCL_P2P_LEVEL=NVL #429
* adding more insights into verious values of NCCL_P2P_LEVEL
- README.md +4 -0
- docs/nccl.md +46 -0
README.md
CHANGED
|
@@ -752,6 +752,10 @@ Try to turn off xformers.
|
|
| 752 |
|
| 753 |
It's safe to ignore it.
|
| 754 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 755 |
## Need help? πβοΈ
|
| 756 |
|
| 757 |
Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
|
|
|
|
| 752 |
|
| 753 |
It's safe to ignore it.
|
| 754 |
|
| 755 |
+
> NCCL Timeouts during training
|
| 756 |
+
|
| 757 |
+
See the [NCCL](docs/nccl.md) guide.
|
| 758 |
+
|
| 759 |
## Need help? πβοΈ
|
| 760 |
|
| 761 |
Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
|
docs/nccl.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NCCL
|
| 2 |
+
|
| 3 |
+
NVIDIA NCCL is a library to facilitate and optimize multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce, etc. Broadly, NCCL configuration is highly environment-specific and is configured via several [environment variables](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html). A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
|
| 4 |
+
|
| 5 |
+
```text
|
| 6 |
+
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.
|
| 7 |
+
```
|
| 8 |
+
|
| 9 |
+
Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised. Nvidia recommends [disabling PCI access control services (ACS)](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs) as a possible solution if this is available to you.
|
| 10 |
+
|
| 11 |
+
Forcing cross-GPU communication via [NVLink](https://en.wikipedia.org/wiki/NVLink) may help without increasing timeouts. To verify that your configuration is leveraging NVLink run the following command:
|
| 12 |
+
|
| 13 |
+
```shell
|
| 14 |
+
nvidia-smi nvlink --status
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
To force NCCL to use NVLink, simply set this in the environment:
|
| 18 |
+
|
| 19 |
+
```shell
|
| 20 |
+
export NCCL_P2P_LEVEL=NVL
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
If NVLink is not available in your environment there are other options for ``NCCL_P2P_LEVEL`` in the table below:
|
| 24 |
+
|
| 25 |
+
| NCCL_P2P_LEVEL | Description |
|
| 26 |
+
| -------------- | ----------- |
|
| 27 |
+
| PIX | P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication. |
|
| 28 |
+
| PXB | P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency. |
|
| 29 |
+
| PHB | P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL) |
|
| 30 |
+
|
| 31 |
+
To validate that acceptable data transfer speeds exist for your training job, running [NCCL Tests](https://github.com/NVIDIA/nccl-tests/blob/master/README.md) can help pinpoint bottlenecks, for example:
|
| 32 |
+
|
| 33 |
+
```shell
|
| 34 |
+
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
|
| 38 |
+
|
| 39 |
+
```shell
|
| 40 |
+
export NCCL_DEBUG=INFO
|
| 41 |
+
export NCCL_DEBUG_SUBSYS=ALL
|
| 42 |
+
export TORCH_DISTRIBUTED_DEBUG=INFO
|
| 43 |
+
export TORCHELASTIC_ERROR_FILE=/PATH/TO/torcherror.log
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ``ddp_timeout`` value in the Axolotl configuration. See [PyTorch init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) for documentation on this value.
|