SunnyLin commited on
Commit
7cca813
·
verified ·
1 Parent(s): 1a72fae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -1
README.md CHANGED
@@ -15,6 +15,7 @@ model-index:
15
  value: 74.8
16
  ---
17
 
 
18
  More information at [DPO-VP](https://github.com/TU2021/DPO-VP).
19
 
20
  Drawing on the ideas from Iterative DPO, we propose a self-improvement process based on the Qwen2.5-Math-7B base model. In this process, we perform sampling-filtering to construct preference datasets for self-improvement using a challenging 8K MATH dataset.
@@ -34,4 +35,44 @@ The final model achieved an average score of 48.2 on five mathematical reasoning
34
  | **[Qwen2.5-7B-PURE-VR](https://huggingface.co/jinachris/PURE-VR)** * | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
35
  | **Qwen2.5-7B-DPO-VP** | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2|
36
 
37
- In the table, all models are fine-tuned based on the Qwen2.5-Math-7B base model. Bolded models represent those that were adjusted using the self-improvement method with exactly the same prompts. The results with * are from my own evaluation, and the results with ^ are derived from the corresponding model's technical report. Note that Qwen2.5-7B-Simple-RL-Zero has not released its trained model, so we evaluated a reproduced version found on Huggingface. Additionally, we observed that due to Qwen's official evaluation code slicing the model, slight differences may arise when evaluating on different numbers of GPUs. Our model and the reproduced results were both evaluated on 4 A800 GPUs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  value: 74.8
16
  ---
17
 
18
+ ## Introduction
19
  More information at [DPO-VP](https://github.com/TU2021/DPO-VP).
20
 
21
  Drawing on the ideas from Iterative DPO, we propose a self-improvement process based on the Qwen2.5-Math-7B base model. In this process, we perform sampling-filtering to construct preference datasets for self-improvement using a challenging 8K MATH dataset.
 
35
  | **[Qwen2.5-7B-PURE-VR](https://huggingface.co/jinachris/PURE-VR)** * | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
36
  | **Qwen2.5-7B-DPO-VP** | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2|
37
 
38
+ In the table, all models are fine-tuned based on the Qwen2.5-Math-7B base model. Bolded models represent those that were adjusted using the self-improvement method with exactly the same prompts. The results with * are from my own evaluation, and the results with ^ are derived from the corresponding model's technical report. Note that Qwen2.5-7B-Simple-RL-Zero has not released its trained model, so we evaluated a reproduced version found on Huggingface. Additionally, we observed that due to Qwen's official evaluation code slicing the model, slight differences may arise when evaluating on different numbers of GPUs. Our model and the reproduced results were both evaluated on 4 A800 GPUs.
39
+
40
+ ## Quick Start
41
+
42
+ ```python
43
+ from transformers import AutoModelForCausalLM, AutoTokenizer
44
+
45
+ model_name = "Qwen/Qwen2.5-Math-7B-Instruct"
46
+ device = "cuda" # the device to load the model onto
47
+
48
+ model = AutoModelForCausalLM.from_pretrained(
49
+ model_name,
50
+ torch_dtype="auto",
51
+ device_map="auto"
52
+ )
53
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
54
+
55
+ prompt = "Find the value of $x$ that satisfies the equation $4x+5 = 6x+7$."
56
+
57
+ messages = [
58
+ {"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
59
+ {"role": "user", "content": prompt}
60
+ ]
61
+
62
+ text = tokenizer.apply_chat_template(
63
+ messages,
64
+ tokenize=False,
65
+ add_generation_prompt=True
66
+ )
67
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
68
+
69
+ generated_ids = model.generate(
70
+ **model_inputs,
71
+ max_new_tokens=512
72
+ )
73
+ generated_ids = [
74
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
75
+ ]
76
+
77
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
78
+ ```