llama3.1_8b_dpo_bwgenerator_test

This model is a fine-tuned version of meta-llama/Meta-Llama-3.1-8B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.0381
Rewards/chosen: -9.3770
Rewards/rejected: -40.9760
Rewards/accuracies: 0.9961
Rewards/margins: 31.5990
Logps/rejected: -519.9075
Logps/chosen: -178.3189
Logits/rejected: -1.4901
Logits/chosen: -1.9907

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0864	0.0719	1000	0.1031	-24.3451	-55.7071	0.9919	31.3620	-667.2187	-328.0001	-1.3920	-1.9101
0.0721	0.1438	2000	0.0666	-17.1956	-43.6489	0.9932	26.4533	-546.6367	-256.5046	-1.3146	-1.8819
0.0513	0.2157	3000	0.0586	-13.4148	-39.7394	0.9932	26.3247	-507.5419	-218.6962	-1.5754	-2.0549
0.0391	0.2876	4000	0.0518	-11.9859	-42.5627	0.9942	30.5768	-535.7746	-204.4073	-1.5376	-2.0293
0.0431	0.3595	5000	0.0584	-15.0281	-51.9022	0.9945	36.8741	-629.1698	-234.8300	-1.5020	-2.0037
0.0386	0.4313	6000	0.0399	-10.5384	-39.9545	0.9961	29.4161	-509.6927	-189.9328	-1.5356	-2.0315
0.0417	0.5032	7000	0.0452	-11.8813	-46.2602	0.9955	34.3789	-572.7493	-203.3616	-1.4399	-1.9551
0.06	0.5751	8000	0.0387	-9.4865	-39.5614	0.9958	30.0749	-505.7617	-179.4136	-1.5289	-2.0209
0.0478	0.6470	9000	0.0376	-9.9444	-40.6988	0.9961	30.7544	-517.1356	-183.9923	-1.5154	-2.0106
0.022	0.7189	10000	0.0399	-9.6813	-41.9896	0.9961	32.3084	-530.0439	-181.3615	-1.4896	-1.9912
0.0254	0.7908	11000	0.0378	-9.1448	-40.6698	0.9961	31.5250	-516.8457	-175.9964	-1.5031	-2.0023
0.0357	0.8627	12000	0.0387	-9.6321	-41.6962	0.9961	32.0641	-527.1096	-180.8692	-1.4851	-1.9878
0.0626	0.9346	13000	0.0381	-9.3770	-40.9760	0.9961	31.5990	-519.9075	-178.3189	-1.4901	-1.9907

Framework versions

PEFT 0.10.0
Transformers 4.44.0
Pytorch 2.3.0+cu121
Datasets 2.14.7
Tokenizers 0.19.1

NanQiangHF
/

llama3.1_8b_dpo_bwgenerator_test

llama3.1_8b_dpo_bwgenerator_test

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for NanQiangHF/llama3.1_8b_dpo_bwgenerator_test

Evaluation results