[2025-02-01 18:47:39,616][oumi][rank0][pid:11750][MainThread][INFO]][train.py:144] Resolved 'training.dataloader_num_workers=auto' to 'training.dataloader_num_workers=8' [2025-02-01 18:47:39,618][oumi][rank0][pid:11750][MainThread][INFO]][train.py:174] TrainingConfig: TrainingConfig(data=DataParams(train=DatasetSplitParams(datasets=[DatasetParams(dataset_name='text_sft_jsonl', dataset_path='data/R1/math_10k_R1_outputs.jsonl', subset=None, split='train', dataset_kwargs={}, sample_count=None, mixture_proportion=None, shuffle=False, seed=None, shuffle_buffer_size=1000, trust_remote_code=False, transform_num_workers=None)], collator_name=None, pack=False, stream=False, target_col=None, mixture_strategy='first_exhausted', seed=42, use_async_dataset=False, use_torchdata=None), test=DatasetSplitParams(datasets=[], collator_name=None, pack=False, stream=False, target_col=None, mixture_strategy='first_exhausted', seed=None, use_async_dataset=False, use_torchdata=None), validation=DatasetSplitParams(datasets=[], collator_name=None, pack=False, stream=False, target_col=None, mixture_strategy='first_exhausted', seed=None, use_async_dataset=False, use_torchdata=None)), model=ModelParams(model_name='HuggingFaceTB/SmolLM2-1.7B-Instruct', adapter_model=None, tokenizer_name=None, tokenizer_pad_token=None, tokenizer_kwargs={}, model_max_length=None, load_pretrained_weights=True, trust_remote_code=True, torch_dtype_str='bfloat16', compile=False, chat_template=None, attn_implementation=None, device_map='auto', model_kwargs={}, enable_liger_kernel=False, shard_for_eval=False, freeze_layers=[]), training=TrainingParams(use_peft=False, trainer_type=, enable_gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, output_dir='output/smollm2-17b-distill-r1-670b-math', per_device_train_batch_size=2, per_device_eval_batch_size=8, gradient_accumulation_steps=2, max_steps=-1, num_train_epochs=1, save_epoch=False, save_steps=0, save_final_model=True, seed=42, run_name='smollm2-17b-distill-r1-670b-math.sky-2025-02-01-13-42-43-696171_sky-d954-bf996_1', metrics_function=None, log_level='info', dep_log_level='warning', enable_wandb=True, enable_tensorboard=True, logging_strategy='steps', logging_dir=None, logging_steps=10, logging_first_step=False, eval_strategy='no', eval_steps=500, learning_rate=2e-05, lr_scheduler_type='linear', lr_scheduler_kwargs={}, warmup_ratio=0.1, warmup_steps=None, optimizer='adamw_torch_fused', weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, sgd_momentum=0.0, mixed_precision_dtype=, compile=False, include_performance_metrics=False, include_alternative_mfu_metrics=False, log_model_summary=False, resume_from_checkpoint=None, try_resume_from_last_checkpoint=False, dataloader_num_workers=8, dataloader_prefetch_factor=32, dataloader_main_process_only=None, ddp_find_unused_parameters=False, max_grad_norm=10.0, trainer_kwargs={}, profiler=ProfilerParams(save_dir=None, enable_cpu_profiling=False, enable_cuda_profiling=False, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, row_limit=50, schedule=ProfilerScheduleParams(enable_schedule=False, wait=0, warmup=1, active=3, repeat=1, skip_first=1)), telemetry=TelemetryParams(telemetry_dir='telemetry', collect_telemetry_for_all_ranks=False, track_gpu_temperature=False), empty_device_cache_steps=1, nccl_default_timeout_minutes=None), peft=PeftParams(lora_r=8, lora_alpha=8, lora_dropout=0.0, lora_target_modules=None, lora_modules_to_save=None, lora_bias='none', lora_init_weights=, lora_task_type=, q_lora=False, q_lora_bits=4, bnb_4bit_quant_type='fp4', use_bnb_nested_quant=False, bnb_4bit_quant_storage='uint8', bnb_4bit_compute_dtype='float32', peft_save_mode=), fsdp=FSDPParams(enable_fsdp=False, sharding_strategy=, cpu_offload=False, mixed_precision=None, backward_prefetch=, forward_prefetch=False, use_orig_params=None, state_dict_type=, auto_wrap_policy=, min_num_params=100000, transformer_layer_cls=None, sync_module_states=True)) [2025-02-01 18:47:39,903][oumi][rank0][pid:11750][MainThread][INFO]][models.py:180] Building model for distributed training (world_size: 4)... [2025-02-01 18:47:39,903][oumi][rank0][pid:11750][MainThread][INFO]][models.py:185] Building model using device_map: cuda:0 (DeviceRankInfo(world_size=4, rank=0, local_world_size=4, local_rank=0))... [2025-02-01 18:47:39,903][oumi][rank0][pid:11750][MainThread][INFO]][models.py:255] Using model class: to instantiate model. [2025-02-01 18:47:41,904][oumi][rank0][pid:11750][MainThread][INFO]][base_map_dataset.py:68] Creating map dataset (type: TextSftJsonLinesDataset) dataset_name: 'text_sft_jsonl', dataset_path: 'None'... [2025-02-01 18:47:41,946][oumi][rank0][pid:11750][MainThread][INFO]][base_map_dataset.py:297] TextSftJsonLinesDataset: features=dict_keys(['input_ids', 'attention_mask']) [2025-02-01 18:47:47,678][oumi][rank0][pid:11750][MainThread][INFO]][base_map_dataset.py:361] Finished transforming dataset (TextSftJsonLinesDataset)! Speed: 1744.52 examples/sec. Examples: 10000. Duration: 5.7 sec. Transform workers: 1. [2025-02-01 18:47:47,943][oumi][rank0][pid:11750][MainThread][INFO]][torch_profiler_utils.py:150] PROF: Torch Profiler disabled! [2025-02-01 18:47:47,998][oumi][rank0][pid:11750][MainThread][INFO]][training.py:49] SFTConfig(output_dir='output/smollm2-17b-distill-r1-670b-math', overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, eval_strategy=, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=2, eval_accumulation_steps=None, eval_delay=0, torch_empty_cache_steps=1, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=10.0, num_train_epochs=1, max_steps=-1, lr_scheduler_type=, lr_scheduler_kwargs={}, warmup_ratio=0.1, warmup_steps=0, log_level='warning', log_level_replica='warning', log_on_each_node=True, logging_dir='output/smollm2-17b-distill-r1-670b-math/runs/Feb01_18-47-47_sky-d954-bf996-370b-head-3sxnf23v-compute', logging_strategy=, logging_first_step=False, logging_steps=10, logging_nan_inf_filter=True, save_strategy=, save_steps=0, save_total_limit=None, save_safetensors=True, save_on_each_node=False, save_only_model=False, restore_callback_states_from_checkpoint=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=False, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=None, local_rank=0, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=8, dataloader_prefetch_factor=32, past_index=-1, run_name='smollm2-17b-distill-r1-670b-math.sky-2025-02-01-13-42-43-696171_sky-d954-bf996_1', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False, 'xla_fsdp_v2': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None, use_configured_state=False), deepspeed=None, label_smoothing_factor=0.0, optim=, optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb', 'tensorboard'], ddp_find_unused_parameters=False, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, include_inputs_for_metrics=False, eval_do_concat_batches=True, fp16_backend='auto', evaluation_strategy=None, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, optim_target_modules=None, batch_eval_metrics=False, eval_on_start=False, use_liger_kernel=False, eval_use_gather_object=False, dataset_text_field=None, packing=False, max_seq_length=None, dataset_num_proc=None, dataset_batch_size=1000, model_init_kwargs=None, dataset_kwargs=None, eval_packing=None, num_of_sequences=1024, chars_per_token=3.6, use_liger=False) [2025-02-01 18:47:48,072][oumi][rank0][pid:11750][MainThread][INFO]][device_utils.py:283] GPU Metrics Before Training: GPU runtime info: NVidiaGpuRuntimeInfo(device_index=0, device_count=4, used_memory_mb=7019.0, temperature=33, fan_speed=None, fan_speeds=None, power_usage_watts=70.637, power_limit_watts=400.0, gpu_utilization=0, memory_utilization=0, performance_state=0, clock_speed_graphics=1155, clock_speed_sm=1155, clock_speed_memory=1593). [2025-02-01 18:47:48,078][oumi][rank0][pid:11750][MainThread][INFO]][train.py:312] Training init time: 10.796s [2025-02-01 18:47:48,078][oumi][rank0][pid:11750][MainThread][INFO]][train.py:313] Starting training... (TrainerType.TRL_SFT, transformers: 4.45.2) [2025-02-01 18:52:35,471][oumi][rank0][pid:11750][MainThread][INFO]][train.py:320] Training is Complete. [2025-02-01 18:52:35,501][oumi][rank0][pid:11750][MainThread][INFO]][device_utils.py:283] GPU Metrics After Training: GPU runtime info: NVidiaGpuRuntimeInfo(device_index=0, device_count=4, used_memory_mb=21283.0, temperature=43, fan_speed=None, fan_speeds=None, power_usage_watts=181.852, power_limit_watts=400.0, gpu_utilization=54, memory_utilization=14, performance_state=0, clock_speed_graphics=1410, clock_speed_sm=1410, clock_speed_memory=1593). [2025-02-01 18:52:35,501][oumi][rank0][pid:11750][MainThread][INFO]][torch_utils.py:117] Peak GPU memory usage: 17.43 GB [2025-02-01 18:52:35,501][oumi][rank0][pid:11750][MainThread][INFO]][train.py:327] Saving final state... [2025-02-01 18:52:35,504][oumi][rank0][pid:11750][MainThread][INFO]][train.py:332] Saving final model... [2025-02-01 18:52:43,074][oumi][rank0][pid:11750][MainThread][INFO]][hf_trainer.py:102] Model has been saved at output/smollm2-17b-distill-r1-670b-math [2025-02-01 18:52:43,650][oumi][rank0][pid:11750][MainThread][INFO]][train.py:339] ยป We're always looking for feedback. What's one thing we can improve? https://oumi.ai/feedback