allenai/Llama-3.1-Tulu-3-8B-DPO

28 days ago

The model card lists: Batch Size (effective): 32 (8B), 128 (70B), 256(405B), but are 8B and 70B reversed?

the 70B config seems like it should be 16 @ https://github.com/allenai/open-instruct/blob/main/configs/train_configs/tulu3/tulu3_dpo_70b.yaml lists

per_device_train_batch_size: 1
gradient_accumulation_steps: 2 # designed for 8 GPUs, so batch size 128

whereas the 8B should be 128 @ https://github.com/allenai/open-instruct/blob/main/configs/train_configs/tulu3/tulu3_dpo_8b.yaml lists

per_device_train_batch_size: 1
gradient_accumulation_steps: 16 # designed for 8 GPUs, so batch size 128

hamishivi

Ai2 org 28 days ago

Ah, good catch! Looking at the original runs for the 8B and 70B models, they should both be 128. The 70b yaml should say designed for 8 nodes (so effective bsz of 882 = 128)

hamishivi changed discussion status to closed 28 days ago