[INFO|2025-04-22 23:31:16] tokenization_utils_base.py:2060 >> loading file tokenizer.model from cache at None

[INFO|2025-04-22 23:31:16] tokenization_utils_base.py:2060 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/tokenizer.json

[INFO|2025-04-22 23:31:16] tokenization_utils_base.py:2060 >> loading file added_tokens.json from cache at None

[INFO|2025-04-22 23:31:16] tokenization_utils_base.py:2060 >> loading file special_tokens_map.json from cache at None

[INFO|2025-04-22 23:31:16] tokenization_utils_base.py:2060 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/tokenizer_config.json

[INFO|2025-04-22 23:31:16] tokenization_utils_base.py:2060 >> loading file chat_template.jinja from cache at None

[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2323 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

[INFO|2025-04-22 23:31:17] configuration_utils.py:693 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/config.json

[INFO|2025-04-22 23:31:17] configuration_utils.py:765 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 30,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "vocab_size": 102400
}


[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2060 >> loading file tokenizer.model from cache at None

[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2060 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/tokenizer.json

[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2060 >> loading file added_tokens.json from cache at None

[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2060 >> loading file special_tokens_map.json from cache at None

[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2060 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/tokenizer_config.json

[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2060 >> loading file chat_template.jinja from cache at None

[INFO|2025-04-22 23:31:17] tokenization_utils_base.py:2323 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

[INFO|2025-04-22 23:31:18] logging.py:143 >> Loading dataset alpaca_en_demo.json...

[INFO|2025-04-22 23:31:28] configuration_utils.py:693 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/config.json

[INFO|2025-04-22 23:31:28] configuration_utils.py:765 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 30,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "vocab_size": 102400
}


[INFO|2025-04-22 23:31:28] logging.py:143 >> Quantizing model to 4 bit with bitsandbytes.

[INFO|2025-04-22 23:31:28] logging.py:143 >> KV cache is disabled during training.

[INFO|2025-04-22 23:31:29] modeling_utils.py:1124 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/pytorch_model.bin.index.json

[INFO|2025-04-22 23:31:30] safetensors_conversion.py:61 >> Attempting to create safetensors variant

[INFO|2025-04-22 23:31:30] safetensors_conversion.py:74 >> Safetensors PR exists

[INFO|2025-04-22 23:33:11] modeling_utils.py:2167 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.

[INFO|2025-04-22 23:33:11] configuration_utils.py:1142 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "use_cache": false
}


[INFO|2025-04-22 23:34:25] modeling_utils.py:4930 >> All model checkpoint weights were used when initializing LlamaForCausalLM.


[INFO|2025-04-22 23:34:25] modeling_utils.py:4938 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at deepseek-ai/deepseek-llm-7b-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.

[INFO|2025-04-22 23:34:25] configuration_utils.py:1097 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/generation_config.json

[INFO|2025-04-22 23:34:25] configuration_utils.py:1142 >> Generate config GenerationConfig {
  "bos_token_id": 100000,
  "eos_token_id": 100001
}


[INFO|2025-04-22 23:34:25] logging.py:143 >> Gradient checkpointing enabled.

[INFO|2025-04-22 23:34:25] logging.py:143 >> Using torch SDPA for faster training and inference.

[INFO|2025-04-22 23:34:25] logging.py:143 >> Upcasting trainable params to float32.

[INFO|2025-04-22 23:34:25] logging.py:143 >> Fine-tuning method: LoRA

[INFO|2025-04-22 23:34:25] logging.py:143 >> Found linear modules: o_proj,q_proj,gate_proj,k_proj,down_proj,v_proj,up_proj

[INFO|2025-04-22 23:34:26] logging.py:143 >> trainable params: 18,739,200 || all params: 6,929,104,896 || trainable%: 0.2704

[INFO|2025-04-22 23:34:26] trainer.py:748 >> Using auto half precision backend

[INFO|2025-04-22 23:34:26] trainer.py:2414 >> ***** Running training *****

[INFO|2025-04-22 23:34:26] trainer.py:2415 >>   Num examples = 1,307

[INFO|2025-04-22 23:34:26] trainer.py:2416 >>   Num Epochs = 3

[INFO|2025-04-22 23:34:26] trainer.py:2417 >>   Instantaneous batch size per device = 2

[INFO|2025-04-22 23:34:26] trainer.py:2420 >>   Total train batch size (w. parallel, distributed & accumulation) = 16

[INFO|2025-04-22 23:34:26] trainer.py:2421 >>   Gradient Accumulation steps = 8

[INFO|2025-04-22 23:34:26] trainer.py:2422 >>   Total optimization steps = 243

[INFO|2025-04-22 23:34:26] trainer.py:2423 >>   Number of trainable parameters = 18,739,200

[INFO|2025-04-22 23:36:16] logging.py:143 >> {'loss': 2.3883, 'learning_rate': 1.9987e-04, 'epoch': 0.06, 'throughput': 36.53}

[INFO|2025-04-22 23:38:05] logging.py:143 >> {'loss': 1.4752, 'learning_rate': 1.9932e-04, 'epoch': 0.12, 'throughput': 36.29}

[INFO|2025-04-22 23:39:57] logging.py:143 >> {'loss': 1.1174, 'learning_rate': 1.9837e-04, 'epoch': 0.18, 'throughput': 36.12}

[INFO|2025-04-22 23:41:45] logging.py:143 >> {'loss': 0.7815, 'learning_rate': 1.9700e-04, 'epoch': 0.24, 'throughput': 36.03}

[INFO|2025-04-22 23:43:35] logging.py:143 >> {'loss': 0.6320, 'learning_rate': 1.9522e-04, 'epoch': 0.31, 'throughput': 36.22}

[INFO|2025-04-22 23:45:25] logging.py:143 >> {'loss': 0.4603, 'learning_rate': 1.9305e-04, 'epoch': 0.37, 'throughput': 36.20}

[INFO|2025-04-22 23:47:17] logging.py:143 >> {'loss': 0.4761, 'learning_rate': 1.9049e-04, 'epoch': 0.43, 'throughput': 36.25}

[INFO|2025-04-22 23:49:06] logging.py:143 >> {'loss': 0.4525, 'learning_rate': 1.8756e-04, 'epoch': 0.49, 'throughput': 36.22}

[INFO|2025-04-22 23:50:55] logging.py:143 >> {'loss': 0.4803, 'learning_rate': 1.8425e-04, 'epoch': 0.55, 'throughput': 36.12}

[INFO|2025-04-22 23:52:45] logging.py:143 >> {'loss': 0.3933, 'learning_rate': 1.8060e-04, 'epoch': 0.61, 'throughput': 36.11}

[INFO|2025-04-22 23:54:39] logging.py:143 >> {'loss': 0.3919, 'learning_rate': 1.7660e-04, 'epoch': 0.67, 'throughput': 36.22}

[INFO|2025-04-22 23:56:33] logging.py:143 >> {'loss': 0.3804, 'learning_rate': 1.7229e-04, 'epoch': 0.73, 'throughput': 36.31}

[INFO|2025-04-22 23:58:22] logging.py:143 >> {'loss': 0.3576, 'learning_rate': 1.6768e-04, 'epoch': 0.80, 'throughput': 36.28}

[INFO|2025-04-23 00:00:11] logging.py:143 >> {'loss': 0.3343, 'learning_rate': 1.6278e-04, 'epoch': 0.86, 'throughput': 36.32}

[INFO|2025-04-23 00:01:59] logging.py:143 >> {'loss': 0.4526, 'learning_rate': 1.5762e-04, 'epoch': 0.92, 'throughput': 36.27}

[INFO|2025-04-23 00:03:48] logging.py:143 >> {'loss': 0.4048, 'learning_rate': 1.5222e-04, 'epoch': 0.98, 'throughput': 36.24}

[INFO|2025-04-23 00:05:46] logging.py:143 >> {'loss': 0.3599, 'learning_rate': 1.4660e-04, 'epoch': 1.05, 'throughput': 36.26}

[INFO|2025-04-23 00:07:38] logging.py:143 >> {'loss': 0.2197, 'learning_rate': 1.4079e-04, 'epoch': 1.11, 'throughput': 36.31}

[INFO|2025-04-23 00:09:28] logging.py:143 >> {'loss': 0.1816, 'learning_rate': 1.3481e-04, 'epoch': 1.17, 'throughput': 36.33}

[INFO|2025-04-23 00:11:19] logging.py:143 >> {'loss': 0.1839, 'learning_rate': 1.2868e-04, 'epoch': 1.23, 'throughput': 36.33}

[INFO|2025-04-23 00:11:19] trainer.py:3984 >> Saving model checkpoint to saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-100

[INFO|2025-04-23 00:11:19] configuration_utils.py:693 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/config.json

[INFO|2025-04-23 00:11:19] configuration_utils.py:765 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 30,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "vocab_size": 102400
}


[INFO|2025-04-23 00:11:19] tokenization_utils_base.py:2510 >> tokenizer config file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-100/tokenizer_config.json

[INFO|2025-04-23 00:11:19] tokenization_utils_base.py:2519 >> Special tokens file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-100/special_tokens_map.json

[INFO|2025-04-23 00:13:14] logging.py:143 >> {'loss': 0.2100, 'learning_rate': 1.2243e-04, 'epoch': 1.29, 'throughput': 36.25}

[INFO|2025-04-23 00:15:04] logging.py:143 >> {'loss': 0.2568, 'learning_rate': 1.1609e-04, 'epoch': 1.35, 'throughput': 36.24}

[INFO|2025-04-23 00:16:56] logging.py:143 >> {'loss': 0.2049, 'learning_rate': 1.0968e-04, 'epoch': 1.42, 'throughput': 36.27}

[INFO|2025-04-23 00:18:44] logging.py:143 >> {'loss': 0.2213, 'learning_rate': 1.0323e-04, 'epoch': 1.48, 'throughput': 36.28}

[INFO|2025-04-23 00:20:34] logging.py:143 >> {'loss': 0.2836, 'learning_rate': 9.6768e-05, 'epoch': 1.54, 'throughput': 36.28}

[INFO|2025-04-23 00:22:25] logging.py:143 >> {'loss': 0.2280, 'learning_rate': 9.0319e-05, 'epoch': 1.60, 'throughput': 36.28}

[INFO|2025-04-23 00:24:15] logging.py:143 >> {'loss': 0.1778, 'learning_rate': 8.3910e-05, 'epoch': 1.66, 'throughput': 36.28}

[INFO|2025-04-23 00:26:03] logging.py:143 >> {'loss': 0.1871, 'learning_rate': 7.7568e-05, 'epoch': 1.72, 'throughput': 36.27}

[INFO|2025-04-23 00:27:52] logging.py:143 >> {'loss': 0.2289, 'learning_rate': 7.1320e-05, 'epoch': 1.78, 'throughput': 36.27}

[INFO|2025-04-23 00:29:41] logging.py:143 >> {'loss': 0.2889, 'learning_rate': 6.5191e-05, 'epoch': 1.84, 'throughput': 36.27}

[INFO|2025-04-23 00:31:32] logging.py:143 >> {'loss': 0.2724, 'learning_rate': 5.9208e-05, 'epoch': 1.91, 'throughput': 36.25}

[INFO|2025-04-23 00:33:23] logging.py:143 >> {'loss': 0.2359, 'learning_rate': 5.3396e-05, 'epoch': 1.97, 'throughput': 36.26}

[INFO|2025-04-23 00:35:23] logging.py:143 >> {'loss': 0.1864, 'learning_rate': 4.7778e-05, 'epoch': 2.04, 'throughput': 36.26}

[INFO|2025-04-23 00:37:13] logging.py:143 >> {'loss': 0.0671, 'learning_rate': 4.2378e-05, 'epoch': 2.10, 'throughput': 36.26}

[INFO|2025-04-23 00:39:02] logging.py:143 >> {'loss': 0.1465, 'learning_rate': 3.7219e-05, 'epoch': 2.16, 'throughput': 36.23}

[INFO|2025-04-23 00:40:51] logging.py:143 >> {'loss': 0.0910, 'learning_rate': 3.2322e-05, 'epoch': 2.22, 'throughput': 36.22}

[INFO|2025-04-23 00:42:41] logging.py:143 >> {'loss': 0.1140, 'learning_rate': 2.7708e-05, 'epoch': 2.28, 'throughput': 36.21}

[INFO|2025-04-23 00:44:33] logging.py:143 >> {'loss': 0.1381, 'learning_rate': 2.3396e-05, 'epoch': 2.34, 'throughput': 36.23}

[INFO|2025-04-23 00:46:22] logging.py:143 >> {'loss': 0.0799, 'learning_rate': 1.9403e-05, 'epoch': 2.40, 'throughput': 36.24}

[INFO|2025-04-23 00:48:14] logging.py:143 >> {'loss': 0.0722, 'learning_rate': 1.5748e-05, 'epoch': 2.46, 'throughput': 36.25}

[INFO|2025-04-23 00:48:14] trainer.py:3984 >> Saving model checkpoint to saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-200

[INFO|2025-04-23 00:48:14] configuration_utils.py:693 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/config.json

[INFO|2025-04-23 00:48:14] configuration_utils.py:765 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 30,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "vocab_size": 102400
}


[INFO|2025-04-23 00:48:15] tokenization_utils_base.py:2510 >> tokenizer config file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-200/tokenizer_config.json

[INFO|2025-04-23 00:48:15] tokenization_utils_base.py:2519 >> Special tokens file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-200/special_tokens_map.json

[INFO|2025-04-23 00:50:13] logging.py:143 >> {'loss': 0.1003, 'learning_rate': 1.2444e-05, 'epoch': 2.53, 'throughput': 36.23}

[INFO|2025-04-23 00:52:01] logging.py:143 >> {'loss': 0.1155, 'learning_rate': 9.5063e-06, 'epoch': 2.59, 'throughput': 36.21}

[INFO|2025-04-23 00:53:51] logging.py:143 >> {'loss': 0.0962, 'learning_rate': 6.9464e-06, 'epoch': 2.65, 'throughput': 36.22}

[INFO|2025-04-23 00:55:43] logging.py:143 >> {'loss': 0.0808, 'learning_rate': 4.7752e-06, 'epoch': 2.71, 'throughput': 36.23}

[INFO|2025-04-23 00:57:33] logging.py:143 >> {'loss': 0.1206, 'learning_rate': 3.0018e-06, 'epoch': 2.77, 'throughput': 36.22}

[INFO|2025-04-23 00:59:23] logging.py:143 >> {'loss': 0.1259, 'learning_rate': 1.6335e-06, 'epoch': 2.83, 'throughput': 36.22}

[INFO|2025-04-23 01:01:11] logging.py:143 >> {'loss': 0.1580, 'learning_rate': 6.7616e-07, 'epoch': 2.89, 'throughput': 36.22}

[INFO|2025-04-23 01:02:59] logging.py:143 >> {'loss': 0.0620, 'learning_rate': 1.3368e-07, 'epoch': 2.95, 'throughput': 36.21}

[INFO|2025-04-23 01:04:07] trainer.py:3984 >> Saving model checkpoint to saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-243

[INFO|2025-04-23 01:04:07] configuration_utils.py:693 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/config.json

[INFO|2025-04-23 01:04:07] configuration_utils.py:765 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 30,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "vocab_size": 102400
}


[INFO|2025-04-23 01:04:08] tokenization_utils_base.py:2510 >> tokenizer config file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-243/tokenizer_config.json

[INFO|2025-04-23 01:04:08] tokenization_utils_base.py:2519 >> Special tokens file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/checkpoint-243/special_tokens_map.json

[INFO|2025-04-23 01:04:09] trainer.py:2681 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|2025-04-23 01:04:09] trainer.py:3984 >> Saving model checkpoint to saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06

[INFO|2025-04-23 01:04:09] configuration_utils.py:693 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--deepseek-ai--deepseek-llm-7b-base/snapshots/7683fea62db869066ddaff6a41d032262c490d4f/config.json

[INFO|2025-04-23 01:04:09] configuration_utils.py:765 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 30,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.51.3",
  "use_cache": true,
  "vocab_size": 102400
}


[INFO|2025-04-23 01:04:09] tokenization_utils_base.py:2510 >> tokenizer config file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/tokenizer_config.json

[INFO|2025-04-23 01:04:09] tokenization_utils_base.py:2519 >> Special tokens file saved in saves/DeepSeek-LLM-7B-Base/lora/train_2025-04-22-23-24-06/special_tokens_map.json

[WARNING|2025-04-23 01:04:10] logging.py:148 >> No metric eval_loss to plot.

[WARNING|2025-04-23 01:04:10] logging.py:148 >> No metric eval_accuracy to plot.

[INFO|2025-04-23 01:04:10] modelcard.py:450 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}