++ torchrun --nproc_per_node=8 --master_port=29729 train.py --model_name_or_path /mnt/data/jiayi/Llama-2-7b-hf --data_path ./alpaca_data.json --bf16 True --output_dir /home/jiayi/alpaca-2 --num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --fsdp 'full_shard auto_wrap' --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer --tf32 True [2024-01-10 07:41:38,265] torch.distributed.run: [WARNING] [2024-01-10 07:41:38,265] torch.distributed.run: [WARNING] ***************************************** [2024-01-10 07:41:38,265] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-01-10 07:41:38,265] torch.distributed.run: [WARNING] ***************************************** /home/jiayi/.local/lib/python3.10/site-packages/transformers/training_args.py:1635: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead warnings.warn( Loading checkpoint shards: 0%| | 0/2 [00:00