Edit model card

zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4945
  • Rewards/chosen: -2.5530
  • Rewards/rejected: -3.6159
  • Rewards/accuracies: 0.7778
  • Rewards/margins: 1.0629
  • Logps/rejected: -606.2373
  • Logps/chosen: -520.2218
  • Logits/rejected: -0.9908
  • Logits/chosen: -1.1030

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6212 0.1047 100 0.6321 -0.3313 -0.5450 0.6944 0.2137 -299.1472 -298.0506 -2.0086 -2.0933
0.5618 0.2094 200 0.5601 -0.8198 -1.3660 0.7222 0.5461 -381.2446 -346.9064 -1.6694 -1.7551
0.54 0.3141 300 0.5265 -1.5221 -2.3343 0.7460 0.8122 -478.0748 -417.1275 -1.0704 -1.1715
0.5261 0.4187 400 0.5082 -1.6553 -2.5263 0.7540 0.8710 -497.2759 -430.4526 -1.1014 -1.2013
0.5107 0.5234 500 0.5059 -2.4506 -3.4250 0.75 0.9744 -587.1476 -509.9848 -0.9852 -1.0956
0.4851 0.6281 600 0.5023 -2.2726 -3.2316 0.7679 0.9590 -567.8049 -492.1783 -0.9970 -1.1078
0.4681 0.7328 700 0.4993 -2.3170 -3.3688 0.7679 1.0517 -581.5197 -496.6232 -1.0068 -1.1190
0.4852 0.8375 800 0.4950 -2.3970 -3.4117 0.7738 1.0147 -585.8156 -504.6183 -1.0237 -1.1353
0.4907 0.9422 900 0.4945 -2.5678 -3.6349 0.7778 1.0671 -608.1346 -521.7063 -0.9901 -1.1024

Framework versions

  • PEFT 0.12.0
  • Transformers 4.44.2
  • Pytorch 2.4.0
  • Datasets 2.21.0
  • Tokenizers 0.19.1
Downloads last month
2
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for taicheng/zephyr-7b-dpo-qlora

Adapter
(1171)
this model

Dataset used to train taicheng/zephyr-7b-dpo-qlora