See axolotl config

axolotl version: 0.8.0.dev0

# === Start-up Commands ===
# curl -LsSf https://astral.sh/uv/install.sh | sh
# export PATH="$HOME/.local/bin:$PATH"
# git clone https://github.com/axolotl-ai-cloud/axolotl
# cd axolotl
# git checkout d8b4027200de0fe60f4ae0a71272c1a8cb2888f7
# uv venv
# source .venv/bin/activate
# uv pip install packaging ninja setuptools ftfy huggingface_hub[cli,hf_transfer]
# uv pip install "cut-cross-entropy[transformers] @ git+https://github.com/strangedove/ml-cross-entropy.git"
# uv pip install apollo-torch
# uv pip install --no-build-isolation -e .[flash-attn,deepspeed]
# uv pip install git+https://github.com/huggingface/transformers.git
# export HF_HUB_ENABLE_HF_TRANSFER=1
# huggingface-cli login --token $hf_key && wandb login $wandb_key
# axolotl preprocess qwen21-pretrain.yml
# axolotl train qwen21-pretrain.yml

# curl -LsSf https://astral.sh/uv/install.sh | sh && export PATH="$HOME/.local/bin:$PATH" && git clone https://github.com/axolotl-ai-cloud/axolotl && uv venv && source .venv/bin/activate && cd axolotl && uv pip install torch==2.5.1 packaging ninja setuptools ftfy huggingface_hub[cli,hf_transfer] && uv pip install "cut-cross-entropy[transformers] @ git+https://github.com/strangedove/ml-cross-entropy.git" && uv pip install apollo-torch && uv pip install --no-build-isolation -e .[flash-attn,deepspeed] && uv pip install git+https://github.com/huggingface/transformers.git && export HF_HUB_ENABLE_HF_TRANSFER=1 && cd .. && huggingface-cli login --token $hf_key && wandb login $wandb_key

# === Model Configuration ===
base_model: unsloth/gemma-3-12b-pt
load_in_8bit: false
load_in_4bit: true

# === HF Configuration === 
hub_model_id: ToastyPigeon/g3-12b-pt-story-qlora
hub_strategy: "every_save"

# === Training Setup ===
num_epochs: 2
micro_batch_size: 2
gradient_accumulation_steps: 2
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

# === Evaluation ===
val_set_size: 100
evals_per_epoch: 5
#eval_table_size:
eval_max_new_tokens: 256
eval_sample_packing: true
#eval_strategy: "no"

# === LoRA Configuration ===
adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 64
lora_dropout: 0.5
lora_target_linear: 
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

#lora_mlp_kernel: true
#lora_qkv_kernel: true
#lora_o_kernel: true

# === Hyperparameter Configuration ===
#optimizer: apollo_adamw_layerwise
optimizer: paged_adamw_8bit
# Apollo-mini configuration:
#optim_args: "proj=random,rank=1,scale=128.0,scale_type=tensor,update_proj_gap=200"
# Regular Apollo configuration:
# optim_args: 
#optim_target_modules: all_linear
learning_rate: 1e-5
lr_scheduler: rex
weight_decay: 0.01
#warmup_ratio: 0.05


# === Data Configuration ===
#chat_template: jinja
#chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
#special_tokens:
#  eos_token: "<end_of_turn>"
shuffle_merged_datasets: true
datasets:
  - path: ToastyPigeon/new-story-dataset
    type: customcompletion-regex
    field: text
dataset_prepared_path: last_run_prepared


# === Plugins ===
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin

# === Hardware Optimization ===
gradient_checkpointing: true
#gradient_checkpointing_kwargs:
#  use_reentrant: true
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
#liger_fused_linear_cross_entropy: true
#unsloth_cross_entropy_loss: true
cut_cross_entropy: true
# Only if using multiple GPUs:
deepspeed: axolotl/deepspeed_configs/zero2.json
max_grad_norm: 2.0

# === Wandb Tracking ===
wandb_project: Gemma
# wandb_entity: [WANDB_ENTITY]
# wandb_name: [WANDB_RUN_NAME]

# === Checkpointing ===
saves_per_epoch: 10
save_total_limit: 1

# === Advanced Settings ===
output_dir: ./ckpts
bf16: auto
flash_attention: true
train_on_inputs: false
group_by_length: false
save_safetensors: true
logging_steps: 1
gc_steps: 10
seed: 69

g3-12b-pt-story-qlora

This model is a fine-tuned version of unsloth/gemma-3-12b-pt on the ToastyPigeon/new-story-dataset dataset. It achieves the following results on the evaluation set:

Loss: 2.6458

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 2
eval_batch_size: 2
seed: 69
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 16
total_eval_batch_size: 8
optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 10
num_epochs: 2.0

Training results

Training Loss	Epoch	Step	Validation Loss
3.7026	0.0058	1	3.8366
3.358	0.2035	35	3.2872
3.0792	0.4070	70	3.0753
2.881	0.6105	105	2.9436
2.9439	0.8140	140	2.8437
2.6859	1.0174	175	2.7684
2.6724	1.2209	210	2.7104
2.6565	1.4244	245	2.6730
2.6235	1.6279	280	2.6528
2.7326	1.8314	315	2.6458

Framework versions

PEFT 0.15.0
Transformers 4.51.0.dev0
Pytorch 2.5.1+cu124
Datasets 3.4.1
Tokenizers 0.21.1

ToastyPigeon
/

g3-12b-pt-story-qlora

g3-12b-pt-story-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for ToastyPigeon/g3-12b-pt-story-qlora

Dataset used to train ToastyPigeon/g3-12b-pt-story-qlora

Evaluation results