mistral-dpo
This model is a fine-tuned version of TheBloke/OpenHermes-2-Mistral-7B-GPTQ on the None dataset. It achieves the following results on the evaluation set:
- Loss: 0.0001
- Rewards/chosen: 2.5279
- Rewards/rejected: -6.8729
- Rewards/accuracies: 1.0
- Rewards/margins: 9.4009
- Logps/rejected: -86.5415
- Logps/chosen: -10.6380
- Logits/rejected: -2.2909
- Logits/chosen: -2.3250
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 2
- training_steps: 50
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5916 | 0.09 | 10 | 0.3053 | 0.6309 | -0.4316 | 1.0 | 1.0625 | -22.1280 | -29.6084 | -2.4417 | -2.4765 |
0.127 | 0.19 | 20 | 0.0029 | 1.9393 | -4.1219 | 1.0 | 6.0612 | -59.0316 | -16.5245 | -2.3738 | -2.4158 |
0.0013 | 0.28 | 30 | 0.0003 | 2.3840 | -5.7860 | 1.0 | 8.1700 | -75.6720 | -12.0770 | -2.3067 | -2.3466 |
0.0002 | 0.37 | 40 | 0.0001 | 2.4704 | -6.5625 | 1.0 | 9.0328 | -83.4367 | -11.2135 | -2.2895 | -2.3248 |
0.0002 | 0.46 | 50 | 0.0001 | 2.5279 | -6.8729 | 1.0 | 9.4009 | -86.5415 | -10.6380 | -2.2909 | -2.3250 |
Framework versions
- Transformers 4.35.2
- Pytorch 2.0.1+cu117
- Datasets 2.15.0
- Tokenizers 0.15.0
Model tree for enkhtogtokh/mistral-dpo
Base model
mistralai/Mistral-7B-v0.1
Finetuned
teknium/OpenHermes-2-Mistral-7B
Quantized
TheBloke/OpenHermes-2-Mistral-7B-GPTQ