YWZBrandon's picture
End of training
7a5c855 verified
[2025-05-10 11:15:09] Created output directory: train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask
[2025-05-10 11:15:09] Chat mode disabled
[2025-05-10 11:15:09] Model size is 3B or smaller (1 B). Using full fine-tuning.
[2025-05-10 11:15:09] No QA format data will be used
[2025-05-10 11:15:09] Limiting dataset size to: 100 samples
[2025-05-10 11:15:09] =======================================
[2025-05-10 11:15:09] Starting training for model: google/gemma-3-1b-pt
[2025-05-10 11:15:09] =======================================
[2025-05-10 11:15:09] CUDA_VISIBLE_DEVICES: 0,1,2,3
[2025-05-10 11:15:09] WANDB_PROJECT: wikidyk-ar
[2025-05-10 11:15:09] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json
[2025-05-10 11:15:09] Global Batch Size: 128
[2025-05-10 11:15:09] Data Size: 100
[2025-05-10 11:15:09] Executing command: torchrun --nproc_per_node "4" --master-port 29501 src/train.py --model_name_or_path "google/gemma-3-1b-pt" --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-5" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy no --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "true" --ds_size 100
[2025-05-10 11:15:09] Training started at Sat May 10 11:15:09 UTC 2025
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792]
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792] *****************************************
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792] *****************************************
Traceback (most recent call last):
File "/root/miniconda3/envs/wikidyk/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
result = agent.run()
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
result = self._invoke_run(role)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
self._initialize_workers(self._worker_group)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
self._rendezvous(worker_group)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 67, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: The server socket has failed to listen on any local network address. port: 29501, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[2025-05-10 11:15:11] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1
[2025-05-10 11:15:11] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1
[2025-05-10 11:15:11] Check error log for details: train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask/20250510_111509.log
[2025-05-10 11:15:11] Resource usage after training google/gemma-3-1b-pt:
[2025-05-10 11:15:11] GPU memory usage:
3635 MiB, 40960 MiB
3615 MiB, 40960 MiB
3619 MiB, 40960 MiB
3611 MiB, 40960 MiB
[2025-05-10 11:15:11] Disk space usage for model outputs:
8.0K train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask
[2025-05-10 11:15:11]