Creation Process: Upscale > PT > SFT > KTO > DPO
Pretrained on approx 300MB of light novels, SFW / NSFW stories and FineWeb-2 corpus.
SFT on approx 8 million tokens, SFW / NSFW RP, stories and creative instruct data.
KTO on antirep data created from the SFT datasets. Rejected examples generated by MS3.2 with repetition_penalty=0.9 and OOC commands encouraging it to misgender, impersonate user etc.
DPO on a high quality RP / NSFW dataset that is unreleased using rejected samples created in the same method as KTO.
Resulting model was non repetitive, but had lost some of the spark the original upscale had. Merged the original upscale back in, making sure to not reintroduce repetition.
>
Mergekit configs
Merge configurations used during the model creation process.
Initial Upscale (Passthrough)
base_model: zerofata/MS3.2-PaintedFantasy-v2-24B
merge_method: passthrough
dtype: bfloat16
slices:
- sources:
- model: zerofata/MS3.2-PaintedFantasy-v2-24B
layer_range: [0, 29]
- sources:
- model: zerofata/MS3.2-PaintedFantasy-v2-24B
layer_range: [10, 39]
Final Merge (Slerp)
models:
- model: zerofata/MS3.2-PaintedFantasy-Visage-33B
- model: ../axolotl/Visage-V2-PT-1-SFT-2-KTO-1-DPO-1/merged
merge_method: slerp
base_model: ../axolotl/Visage-V2-PT-1-SFT-2-KTO-1-DPO-1/merged
parameters:
t: [0.4, 0.2, 0, 0.2, 0.4]
dtype: bfloat16
>
Axolotl configs
Not optimized for cost / performance efficiency, YMMV.
Pretrain 4*H100
# ====================
# MODEL CONFIGURATION
# ====================
base_model: ../mergekit/pf_v2_upscale
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
# ====================
# DATASET CONFIGURATION
# ====================
datasets:
- path: ./data/pretrain_dataset_v5_stripped.jsonl
type: completion
dataset_prepared_path:
train_on_inputs: false # Only train on assistant responses
# ====================
# QLORA CONFIGURATION
# ====================
adapter: qlora
load_in_4bit: true
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
# lora_modules_to_save: # Uncomment only if you added NEW tokens
# ====================
# TRAINING PARAMETERS
# ====================
num_epochs: 1
micro_batch_size: 4
gradient_accumulation_steps: 1
learning_rate: 4e-5
optimizer: paged_adamw_8bit
lr_scheduler: rex
warmup_ratio: 0.05
weight_decay: 0.01
max_grad_norm: 1.0
# ====================
# SEQUENCE & PACKING
# ====================
sequence_len: 12288
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
# ====================
# HARDWARE OPTIMIZATIONS
# ====================
bf16: auto
flash_attention: true
gradient_checkpointing: offload
deepspeed: deepspeed_configs/zero1.json
plugins:
- axolotl.integrations.liger.LigerPlugin
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_cross_entropy: false # Cut Cross Entropy overrides this
liger_fused_linear_cross_entropy: false # Cut Cross Entropy overrides this
# ====================
# EVALUATION & CHECKPOINTING
# ====================
save_strategy: steps
save_steps: 40
save_total_limit: 5 # Keep best + last few checkpoints
load_best_model_at_end: true
greater_is_better: false
# ====================
# LOGGING & OUTPUT
# ====================
output_dir: ./Visage-V2-PT-1
logging_steps: 2
save_safetensors: true
# ====================
# WANDB TRACKING
# ====================
wandb_project: Visage-V2-PT
# wandb_entity: your_entity
wandb_name: Visage-V2-PT-1
SFT 4*H100
# ====================
# MODEL CONFIGURATION
# ====================
base_model: ./Visage-V2-PT-1/merged
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
# ====================
# DATASET CONFIGURATION
# ====================
datasets:
- path: ./data/automated_dataset.jsonl
type: chat_template
split: train
chat_template_strategy: tokenizer
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]
- path: ./data/handcrafted_dataset.jsonl
type: chat_template
split: train
chat_template_strategy: tokenizer
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]
- path: ./data/instruct_dataset.jsonl
type: chat_template
split: train
chat_template_strategy: tokenizer
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]
- path: ./data/cw_dataset.jsonl
type: chat_template
split: train
chat_template_strategy: tokenizer
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]
- path: ./data/stories_dataset.jsonl
type: chat_template
split: train
chat_template_strategy: tokenizer
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]
- path: ./data/cw_claude_dataset.jsonl
type: chat_template
split: train
chat_template_strategy: tokenizer
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]
- path: ./data/summaries_dataset.jsonl
type: chat_template
split: train
chat_template_strategy: tokenizer
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
system: ["system"]
dataset_prepared_path:
train_on_inputs: false # Only train on assistant responses
# ====================
# QLORA CONFIGURATION
# ====================
adapter: qlora
load_in_4bit: true
lora_r: 128
lora_alpha: 128
lora_dropout: 0.1
lora_target_linear: true
# lora_modules_to_save: # Uncomment only if you added NEW tokens
# ====================
# TRAINING PARAMETERS
# ====================
num_epochs: 2
micro_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1e-5
optimizer: paged_adamw_8bit
lr_scheduler: rex
warmup_ratio: 0.05
weight_decay: 0.01
max_grad_norm: 1.0
# ====================
# SEQUENCE & PACKING
# ====================
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
# ====================
# HARDWARE OPTIMIZATIONS
# ====================
bf16: auto
flash_attention: true
gradient_checkpointing: offload
deepspeed: deepspeed_configs/zero1.json
plugins:
- axolotl.integrations.liger.LigerPlugin
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_cross_entropy: false # Cut Cross Entropy overrides this
liger_fused_linear_cross_entropy: false # Cut Cross Entropy overrides this
# ====================
# EVALUATION & CHECKPOINTING
# ====================
save_strategy: steps
save_steps: 20
save_total_limit: 5 # Keep best + last few checkpoints
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false
# ====================
# LOGGING & OUTPUT
# ====================
output_dir: ./Visage-V2-PT-1-SFT-2
logging_steps: 2
save_safetensors: true
# ====================
# WANDB TRACKING
# ====================
wandb_project: Visage-V2-SFT
# wandb_entity: your_entity
wandb_name: Visage-V2-PT-1-SFT-2
KTO 4*H100
# ====================
# MODEL CONFIGURATION
# ====================
base_model: ./Visage-V2-PT-1-SFT-2/merged
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
# ====================
# RL/DPO CONFIGURATION
# ====================
rl: kto
rl_beta: 0.1
kto_desirable_weight: 1.25
kto_undesirable_weight: 1.0
# ====================
# DATASET CONFIGURATION
# ====================
datasets:
- path: ./handcrafted_dataset_kto.jsonl
type: llama3.argilla
- path: ./approved_rp_dataset_kto.jsonl
type: llama3.argilla
- path: ./instruct_dataset_kto.jsonl
type: llama3.argilla
dataset_prepared_path:
train_on_inputs: false # Only train on assistant responses
remove_unused_columns: False
# ====================
# QLORA CONFIGURATION
# ====================
adapter: qlora
load_in_4bit: true
lora_r: 32
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
# lora_modules_to_save: # Uncomment only if you added NEW tokens
# ====================
# TRAINING PARAMETERS
# ====================
num_epochs: 1
micro_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 5e-6
optimizer: adamw_8bit
lr_scheduler: cosine
warmup_steps: 15
weight_decay: 0.001
max_grad_norm: 0.01
# ====================
# SEQUENCE CONFIGURATION
# ====================
sequence_len: 8192
pad_to_sequence_len: true
# ====================
# HARDWARE OPTIMIZATIONS
# ====================
bf16: auto
tf32: false
flash_attention: true
gradient_checkpointing: offload
deepspeed: deepspeed_configs/zero1.json
plugins:
- axolotl.integrations.liger.LigerPlugin
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_cross_entropy: false # Cut Cross Entropy overrides this
liger_fused_linear_cross_entropy: false # Cut Cross Entropy overrides this
# ====================
# CHECKPOINTING
# ====================
save_steps: 100
save_total_limit: 10
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false
# ====================
# LOGGING & OUTPUT
# ====================
output_dir: ./Visage-V2-PT-1-SFT-2-KTO-1
logging_steps: 2
save_safetensors: true
# ====================
# WANDB TRACKING
# ====================
wandb_project: Visage-V2-KTO
# wandb_entity: your_entity
wandb_name: Visage-V2-PT-1-SFT-2-KTO-1
DPO 4*H100
# ====================
# MODEL CONFIGURATION
# ====================
base_model: ./Visage-V2-PT-1-SFT-2/merged
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
# ====================
# RL/DPO CONFIGURATION
# ====================
rl: dpo
rl_beta: 0.1
# ====================
# DATASET CONFIGURATION
# ====================
datasets:
- path: ./handcrafted_dataset_mistral_rep.jsonl
type: chat_template.default
field_messages: messages
field_chosen: chosen
field_rejected: rejected
message_property_mappings:
role: role
content: content
roles:
system: ["system"]
user: ["user"]
assistant: ["assistant"]
dataset_prepared_path:
train_on_inputs: false # Only train on assistant responses
# ====================
# QLORA CONFIGURATION
# ====================
adapter: qlora
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
lora_target_linear: true
# lora_modules_to_save: # Uncomment only if you added NEW tokens
# ====================
# TRAINING PARAMETERS
# ====================
num_epochs: 1
micro_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 2e-6
optimizer: adamw_8bit
lr_scheduler: cosine
warmup_steps: 5
weight_decay: 0.01
max_grad_norm: 1.0
# ====================
# SEQUENCE CONFIGURATION
# ====================
sequence_len: 8192
pad_to_sequence_len: true
# ====================
# HARDWARE OPTIMIZATIONS
# ====================
bf16: auto
tf32: false
flash_attention: true
gradient_checkpointing: offload
deepspeed: deepspeed_configs/zero1.json
plugins:
- axolotl.integrations.liger.LigerPlugin
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_cross_entropy: false # Cut Cross Entropy overrides this
liger_fused_linear_cross_entropy: false # Cut Cross Entropy overrides this
# ====================
# CHECKPOINTING
# ====================
save_steps: 10
save_total_limit: 10
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false
# ====================
# LOGGING & OUTPUT
# ====================
output_dir: ./Visage-V2-PT-1-SFT-2-DPO-1
logging_steps: 2
save_safetensors: true
# ====================
# WANDB TRACKING
# ====================
wandb_project: Visage-V2-DPO
# wandb_entity: your_entity
wandb_name: Visage-V2-PT-1-SFT-2-DPO-1