See axolotl config
axolotl version: 0.10.0.dev0
base_model: Dans-DiscountModels/Mistral-Nemo-Base-2407-DanChat
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code:
# wandb configuration
wandb_project: 12b-mn-dans-personality-engine
wandb_watch:
wandb_run_id: V1.3.0-1-4 # V{Version}-{Run Number}-{Attempt Number}
wandb_log_model:
# push checkpoints to hub
hub_model_id: Dans-DiscountModels/12b-mn-dans-personality-engine-v1.3.0-TestArticle-1
# how to push checkpoints to hub
# https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
hub_strategy: "every_save"
# Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
# Required to be true when used in combination with `push_dataset_to_hub`
hf_use_auth_token: true
# where to save the finished model to
output_dir: ./12b-mn-dans-personality-engine-v1.3.0
# dataset settings (local or huggingface repo)
datasets:
- path: Dans-DiscountModels/pretokenization-test-5
ds_type: parquet
type:
plugins:
- axolotl.integrations.liger.LigerPlugin
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true
cut_cross_entropy: true
load_in_8bit: false
load_in_4bit: false
strict: false
adapter:
lora_model_dir:
dataset_prepared_path: ./12b-mn-dans-personality-engine-data
val_set_size: 0.003
sequence_len: 32768
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true
gradient_checkpointing: true
gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 2
optimizer: ademamix_8bit
optim_args: "beta1=0.9,beta2=0.999,beta3=0.999,alpha=5"
lr_scheduler: rex
learning_rate: 0.00001
cosine_min_lr_ratio:
weight_decay:
max_grad_norm: 0.001
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch: 24
eval_table_size:
eval_max_new_tokens:
saves_per_epoch: 2
save_total_limit: 1
debug: false
deepspeed: deepspeed_configs/zero3_bf16.json
fsdp:
fsdp_config:
special_tokens:
12b-mn-dans-personality-engine-v1.3.0-TestArticle-1
This model is a fine-tuned version of Dans-DiscountModels/Mistral-Nemo-Base-2407-DanChat on the Dans-DiscountModels/pretokenization-test-5 dataset. It achieves the following results on the evaluation set:
- Loss: 1.4392
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- total_eval_batch_size: 16
- optimizer: Use ademamix_8bit and the args are: beta1=0.9,beta2=0.999,beta3=0.999,alpha=5
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 321
- num_epochs: 2.0
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.8086 | 0.0006 | 1 | 1.7459 |
1.593 | 0.0417 | 67 | 1.5911 |
1.5578 | 0.0833 | 134 | 1.5565 |
1.5782 | 0.1250 | 201 | 1.5436 |
1.5702 | 0.1666 | 268 | 1.5377 |
1.5926 | 0.2083 | 335 | 1.5328 |
1.6364 | 0.2499 | 402 | 1.5291 |
1.5082 | 0.2916 | 469 | 1.5234 |
1.6002 | 0.3332 | 536 | 1.5197 |
1.5252 | 0.3749 | 603 | 1.5162 |
1.5915 | 0.4165 | 670 | 1.5121 |
1.5108 | 0.4582 | 737 | 1.5103 |
1.5663 | 0.4998 | 804 | 1.5063 |
1.5085 | 0.5415 | 871 | 1.5037 |
1.4273 | 0.5832 | 938 | 1.5024 |
1.5528 | 0.6248 | 1005 | 1.4994 |
1.6072 | 0.6665 | 1072 | 1.4975 |
1.6074 | 0.7081 | 1139 | 1.4920 |
1.5495 | 0.7498 | 1206 | 1.4904 |
1.6117 | 0.7914 | 1273 | 1.4883 |
1.4621 | 0.8331 | 1340 | 1.4850 |
1.6381 | 0.8747 | 1407 | 1.4838 |
1.4221 | 0.9164 | 1474 | 1.4813 |
1.5812 | 0.9580 | 1541 | 1.4789 |
1.4581 | 0.9997 | 1608 | 1.4750 |
1.4608 | 1.0417 | 1675 | 1.4800 |
1.5261 | 1.0833 | 1742 | 1.4798 |
1.3856 | 1.1250 | 1809 | 1.4796 |
1.4469 | 1.1666 | 1876 | 1.4766 |
1.4783 | 1.2083 | 1943 | 1.4741 |
1.5025 | 1.2499 | 2010 | 1.4733 |
1.4531 | 1.2916 | 2077 | 1.4726 |
1.4719 | 1.3332 | 2144 | 1.4712 |
1.4123 | 1.3749 | 2211 | 1.4700 |
1.4653 | 1.4165 | 2278 | 1.4673 |
1.4571 | 1.4582 | 2345 | 1.4660 |
1.4261 | 1.4998 | 2412 | 1.4660 |
1.3212 | 1.5415 | 2479 | 1.4620 |
1.3828 | 1.5832 | 2546 | 1.4617 |
1.3617 | 1.6248 | 2613 | 1.4597 |
1.4364 | 1.6665 | 2680 | 1.4567 |
1.4686 | 1.7081 | 2747 | 1.4549 |
1.3317 | 1.7498 | 2814 | 1.4530 |
1.3749 | 1.7914 | 2881 | 1.4506 |
1.4116 | 1.8331 | 2948 | 1.4468 |
1.3988 | 1.8747 | 3015 | 1.4456 |
1.2534 | 1.9164 | 3082 | 1.4448 |
1.3564 | 1.9580 | 3149 | 1.4412 |
1.3668 | 1.9997 | 3216 | 1.4392 |
Framework versions
- Transformers 4.51.3
- Pytorch 2.4.1+cu121
- Datasets 3.5.1
- Tokenizers 0.21.1
- Downloads last month
- 39
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support