SunnyLin
/

Qwen2.5-7B-DPO-VP

Safetensors

qwen2

Eval Results

Model card Files Files and versions Community

SunnyLin commited on Feb 18

Commit

7cca813

verified ·

1 Parent(s): 1a72fae

Update README.md

Browse files

Files changed (1) hide show

README.md +42 -1

README.md CHANGED Viewed

@@ -15,6 +15,7 @@ model-index:
             value: 74.8
 ---
 More information at [DPO-VP](https://github.com/TU2021/DPO-VP).
 Drawing on the ideas from Iterative DPO, we propose a self-improvement process based on the Qwen2.5-Math-7B base model. In this process, we perform sampling-filtering to construct preference datasets for self-improvement using a challenging 8K MATH dataset.
@@ -34,4 +35,44 @@ The final model achieved an average score of 48.2 on five mathematical reasoning
 | **[Qwen2.5-7B-PURE-VR](https://huggingface.co/jinachris/PURE-VR)** * | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
 | **Qwen2.5-7B-DPO-VP** | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2|
-In the table, all models are fine-tuned based on the Qwen2.5-Math-7B base model. Bolded models represent those that were adjusted using the self-improvement method with exactly the same prompts. The results with * are from my own evaluation, and the results with ^ are derived from the corresponding model's technical report. Note that Qwen2.5-7B-Simple-RL-Zero has not released its trained model, so we evaluated a reproduced version found on Huggingface. Additionally, we observed that due to Qwen's official evaluation code slicing the model, slight differences may arise when evaluating on different numbers of GPUs. Our model and the reproduced results were both evaluated on 4 A800 GPUs.

             value: 74.8
 ---
+## Introduction
 More information at [DPO-VP](https://github.com/TU2021/DPO-VP).
 Drawing on the ideas from Iterative DPO, we propose a self-improvement process based on the Qwen2.5-Math-7B base model. In this process, we perform sampling-filtering to construct preference datasets for self-improvement using a challenging 8K MATH dataset.
 | **[Qwen2.5-7B-PURE-VR](https://huggingface.co/jinachris/PURE-VR)** * | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
 | **Qwen2.5-7B-DPO-VP** | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2|
+In the table, all models are fine-tuned based on the Qwen2.5-Math-7B base model. Bolded models represent those that were adjusted using the self-improvement method with exactly the same prompts. The results with * are from my own evaluation, and the results with ^ are derived from the corresponding model's technical report. Note that Qwen2.5-7B-Simple-RL-Zero has not released its trained model, so we evaluated a reproduced version found on Huggingface. Additionally, we observed that due to Qwen's official evaluation code slicing the model, slight differences may arise when evaluating on different numbers of GPUs. Our model and the reproduced results were both evaluated on 4 A800 GPUs.
+## Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Qwen/Qwen2.5-Math-7B-Instruct"
+device = "cuda" # the device to load the model onto
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+prompt = "Find the value of $x$ that satisfies the equation $4x+5 = 6x+7$."
+messages = [
+    {"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```