|
[2025-05-10 11:15:09] Created output directory: train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask |
|
[2025-05-10 11:15:09] Chat mode disabled |
|
[2025-05-10 11:15:09] Model size is 3B or smaller (1 B). Using full fine-tuning. |
|
[2025-05-10 11:15:09] No QA format data will be used |
|
[2025-05-10 11:15:09] Limiting dataset size to: 100 samples |
|
[2025-05-10 11:15:09] ======================================= |
|
[2025-05-10 11:15:09] Starting training for model: google/gemma-3-1b-pt |
|
[2025-05-10 11:15:09] ======================================= |
|
[2025-05-10 11:15:09] CUDA_VISIBLE_DEVICES: 0,1,2,3 |
|
[2025-05-10 11:15:09] WANDB_PROJECT: wikidyk-ar |
|
[2025-05-10 11:15:09] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json |
|
[2025-05-10 11:15:09] Global Batch Size: 128 |
|
[2025-05-10 11:15:09] Data Size: 100 |
|
[2025-05-10 11:15:09] Executing command: torchrun --nproc_per_node "4" --master-port 29501 src/train.py --model_name_or_path "google/gemma-3-1b-pt" --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-5" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy no --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "true" --ds_size 100 |
|
[2025-05-10 11:15:09] Training started at Sat May 10 11:15:09 UTC 2025 |
|
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792] |
|
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792] ***************************************** |
|
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0510 11:15:11.296000 361019 site-packages/torch/distributed/run.py:792] ***************************************** |
|
Traceback (most recent call last): |
|
File "/root/miniconda3/envs/wikidyk/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper |
|
return f(*args, **kwargs) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main |
|
run(args) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run |
|
elastic_launch( |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent |
|
result = agent.run() |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper |
|
result = f(*args, **kwargs) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run |
|
result = self._invoke_run(role) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run |
|
self._initialize_workers(self._worker_group) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper |
|
result = f(*args, **kwargs) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers |
|
self._rendezvous(worker_group) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper |
|
result = f(*args, **kwargs) |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous |
|
rdzv_info = spec.rdzv_handler.next_rendezvous() |
|
File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 67, in next_rendezvous |
|
self._store = TCPStore( # type: ignore[call-arg] |
|
RuntimeError: The server socket has failed to listen on any local network address. port: 29501, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use |
|
[2025-05-10 11:15:11] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1 |
|
[2025-05-10 11:15:11] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1 |
|
[2025-05-10 11:15:11] Check error log for details: train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask/20250510_111509.log |
|
[2025-05-10 11:15:11] Resource usage after training google/gemma-3-1b-pt: |
|
[2025-05-10 11:15:11] GPU memory usage: |
|
3635 MiB, 40960 MiB |
|
3615 MiB, 40960 MiB |
|
3619 MiB, 40960 MiB |
|
3611 MiB, 40960 MiB |
|
[2025-05-10 11:15:11] Disk space usage for model outputs: |
|
8.0K train_results_pred_mask/google_gemma-3-1b-pt_ds100_upsample1000_predict_mask |
|
[2025-05-10 11:15:11] |
|
|