| Here is the full benchmark code and outputs: | |
| ```bash | |
| DDP w/ NVLink | |
| rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ | |
| --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ | |
| --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ | |
| --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 | |
| {'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} | |
| DDP w/o NVLink | |
| rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ | |
| --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ | |
| --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train | |
| --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 | |
| {'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} | |
| Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) | |
| Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0 |