See axolotl config
axolotl version: 0.12.2
base_model: google/gemma-3n-E2B-it
hub_model_id: sudoping01/bambara-llm-exp2
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
# Memory optimization for multi-GPU
load_in_8bit: false
load_in_4bit: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
# Multi-GPU configuration
ddp: true
chat_template: gemma3n
eot_tokens:
- <end_of_turn>
# Your Bambara dataset
datasets:
- path: sudoping01/bambara-instructions
type: chat_template
split: train
name: cleaned
field_messages: messages
message_property_mappings:
role: role
content: content
val_set_size: 0.05
output_dir: ./outputs/bambara-gemma3n
adapter: qlora
lora_r: 16 # REDUCED from 32
lora_alpha: 32 # Kept at 2x lora_r
lora_dropout: 0.05
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'
# Sequence and packing - DRASTICALLY REDUCED
sequence_len: 4096 # REDUCED from 32000 to 4096
sample_packing: false # DISABLED for memory
eval_sample_packing: false # DISABLED for memory
pad_to_sequence_len: false # DISABLED for memory
# Multi-GPU optimized batch sizes - VERY CONSERVATIVE
micro_batch_size: 8 # REDUCED from 8 to 1
gradient_accumulation_steps: 16 # INCREASED to maintain training effectiveness
num_epochs: 5 # REDUCED from 20
# Optimized training parameters
optimizer: adamw_8bit # More memory efficient
lr_scheduler: cosine
learning_rate: 0.001 # REDUCED from 0.001
warmup_ratio: 0.1 # INCREASED warmup
weight_decay: 0.01
# Precision and performance
bf16: auto
tf32: false # DISABLED to save memory
# Logging and saving
logging_steps: 10
saves_per_epoch: 1
evals_per_epoch: 0 # DISABLED evaluation to save memory
# Performance optimizations - REDUCED
dataloader_num_workers: 4 # REDUCED to 0 to save memory
dataloader_pin_memory: false # DISABLED to save memory
group_by_length: false # DISABLED to save memory
special_tokens:
bambara-llm-exp2
This model is a fine-tuned version of google/gemma-3n-E2B-it on the sudoping01/bambara-instructions dataset.
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 16
- total_train_batch_size: 1024
- total_eval_batch_size: 64
- optimizer: Use adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 422
- training_steps: 4224
Training results
Framework versions
- PEFT 0.17.0
- Transformers 4.55.2
- Pytorch 2.6.0+cu124
- Datasets 4.0.0
- Tokenizers 0.21.4
- Downloads last month
- 13