qgallouedec HF staff commited on
Commit
4981722
·
verified ·
1 Parent(s): 9a5e8f8

End of training

Browse files
Files changed (3) hide show
  1. README.md +3 -3
  2. model.safetensors +1 -1
  3. training_args.bin +2 -2
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  base_model: Qwen/Qwen2-0.5B-Instruct
3
- datasets: trl-lib/tldr-preference
4
  library_name: transformers
5
  model_name: dpo-qwen2
6
  tags:
@@ -12,7 +12,7 @@ licence: license
12
 
13
  # Model Card for dpo-qwen2
14
 
15
- This model is a fine-tuned version of [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [trl-lib/tldr-preference](https://huggingface.co/datasets/trl-lib/tldr-preference) dataset.
16
  It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
  ## Quick start
@@ -28,7 +28,7 @@ print(output["generated_text"])
28
 
29
  ## Training procedure
30
 
31
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/huggingface/huggingface/runs/90tpt217)
32
 
33
  This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
34
 
 
1
  ---
2
  base_model: Qwen/Qwen2-0.5B-Instruct
3
+ datasets: trl-lib/Capybara-Preferences
4
  library_name: transformers
5
  model_name: dpo-qwen2
6
  tags:
 
12
 
13
  # Model Card for dpo-qwen2
14
 
15
+ This model is a fine-tuned version of [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [trl-lib/Capybara-Preferences](https://huggingface.co/datasets/trl-lib/Capybara-Preferences) dataset.
16
  It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
  ## Quick start
 
28
 
29
  ## Training procedure
30
 
31
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/huggingface/trl/runs/a8jlsgpf)
32
 
33
  This model was trained with DPO, a method introduced in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290).
34
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6f8b51d6c2df7d85ed2765fe72fcd694225bac7ab9010c018d78ed2256ddf8bc
3
  size 1976163472
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:003da1c30dc253a18402911f38deeba08e9d9b38f1824ec8027f20f8ce7a5db3
3
  size 1976163472
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1e5edccc131463e114a8ce8df75b51279e7be0a3f4b81fec569e5a80b4b87116
3
- size 6008
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79df2cd828b1674e03fbff845d93115a2a3f2a98a86d40fc4a3a5383de1f4bb2
3
+ size 5944