KAKA22 commited on
Commit
b91222a
·
verified ·
1 Parent(s): 23fa8d6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -4
README.md CHANGED
@@ -12,11 +12,11 @@ tags:
12
  - llama
13
  ---
14
 
15
- # Model Description
16
 
17
  CodeRM-8B is a small yet powerful model designed to enable efficient and high-quality unit test generation.
18
- It is trained based on **Llama3.1-8B-Instruct** on a dataset of 60k high-quality synthetic Python unit tests.
19
- These unit tests are derived from two well-regarded code instruction tuning datasets:
20
  [CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) and the
21
  training set of [TACO](https://huggingface.co/datasets/BAAI/TACO).
22
  The training dataset used for unit test generation is openly available under
@@ -25,6 +25,65 @@ The training dataset used for unit test generation is openly available under
25
  For further information and details of training, refer to our paper:
26
  "Dynamic Scaling of Unit Tests for Code Reward Modeling" available on arXiv.
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  # Prompt Format
29
 
30
  ```
@@ -40,6 +99,11 @@ Please add detailed comments to the test cases you write. You do not need to tes
40
  ```
41
 
42
  # Citation
 
43
  If you find our model helpful, please cite the original paper:
44
  ```
45
- ```
 
 
 
 
 
12
  - llama
13
  ---
14
 
15
+ # Introduction
16
 
17
  CodeRM-8B is a small yet powerful model designed to enable efficient and high-quality unit test generation.
18
+ It is trained on a dataset of 60k high-quality synthetic Python unit tests using Llama3.1-70B-Instruct.
19
+ These unit tests are synthesized based on two well-regarded code instruction tuning datasets:
20
  [CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) and the
21
  training set of [TACO](https://huggingface.co/datasets/BAAI/TACO).
22
  The training dataset used for unit test generation is openly available under
 
25
  For further information and details of training, refer to our paper:
26
  "Dynamic Scaling of Unit Tests for Code Reward Modeling" available on arXiv.
27
 
28
+ # Model Information
29
+
30
+ The model is trained based on [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
31
+
32
+ # Performance
33
+
34
+ ## Best-of-N
35
+
36
+ First, we evaluate the performance of CodeRM-8B using a best-of-N setting. In this setup, an LLM (policy model) generates
37
+ 100 candidate code solutions for a given programming problem, while another LLM (reward model) generates 100 unit
38
+ tests. The optimal code solution is then selected based on majority voting derived from the execution results
39
+ of these unit tests.
40
+
41
+ Under this framework, our trained unit test generator demonstrates performance comparable to Llama3.1-70B-Instruct,
42
+ despite having an 8x smaller parameter size. The detailed evaluation results across three well-known benchmarks
43
+ are as follows:
44
+
45
+ | Model | Policy: Llama3-8B | Policy: Llama3-70B | Policy: GPT-3.5 | Policy: GPT-4o-mini |
46
+ | :------ | :------ | :------ | :------ | :------ |
47
+ | **Benchmark: HumanEval Plus** |||||
48
+ | Vanilla | 53.58 | 73.74 | 67.83 | 82.96 |
49
+ | Reward: Llama3.1-8B | 66.84 (+13.26) | 77.14 (+3.40) | 76.32 (+8.49) | 83.11 (+0.15) |
50
+ | Reward: Llama3.1-70B | **72.04 (+18.46)** | <u>78.54 (+4.80</u>) | **79.76 (+11.93)** | <u>85.45 (+2.49</u>) |
51
+ | Reward: CodeRM-8B | <u>72.01 (+18.43</u>) | **78.69 (+4.95)** | <u>78.01 (+10.18</u>) | **86.38 (+3.42)** |
52
+ | **Benchmark: MBPP Plus** |||||
53
+ | Vanilla | 49.20 | 69.33 | 70.53 | 71.59 |
54
+ | Reward: Llama3.1-8B | 64.31 (+15.11) | 71.64 (+2.31) | 74.18 (+3.65) | 74.48 (+2.89) |
55
+ | Reward: Llama3.1-70B | <u>65.26 (+16.06</u>) | <u>71.85 (+2.52</u>) | <u>75.72 (+5.19</u>) | <u>74.96 (+3.37</u>) |
56
+ | Reward: CodeRM-8B | **66.71 (+17.51)** | **72.44 (+3.11)** | **75.96 (+5.43)** | **75.20 (+3.61)** |
57
+ | **Benchmark: LiveCodeBench** |||||
58
+ | Vanilla | 11.98 | 25.30 | 20.55 | 34.83 |
59
+ | Reward: Llama3.1-70B | <u>13.28 (+1.30</u>) | **28.46 (+3.16)** | **22.80 (+2.25)** | <u>38.60 (+3.77</u>) |
60
+ | Reward: CodeRM-8B | **15.21 (+3.23)** | <u>27.73 (+2.43</u>)| <u>21.76 (+1.21</u>) | **39.20 (+4.37)** |
61
+
62
+ ## Quality of Unit Test
63
+
64
+ We evaluate the quality of the unit test generated by CodeRM-8B. As each unit test functions as a classifier to
65
+ determine correct or incorrect solutions, we first utilize accuracy and F1 score as metrics to assess the
66
+ classification performance of the unit test.
67
+
68
+ We further propose two new metrics to detailed evaluate the possibility of the unit test making incorrect judgments.
69
+ False Acceptance Rate (FAR) measures the probability of wrong solutions being accepted by unit tests.
70
+ False Rejection Rate (FRR) measures the probability of correct solutions being rejected by unit tests.
71
+ The calculation formulas for these four metrics are introduced in Appendix D of the paper.
72
+
73
+ Below is the quality of individual unit tests and the combination of multiple unit tests on HumanEval Plus,
74
+ utilizing Llama3.1-8B as the policy model. The top two performances are marked in **bold** and _underlined_.
75
+
76
+ | **Model** | **Acc (↑)** | **F1 (↑)** | **FAR (↓)** | **FRR (↓)** |
77
+ |----------------------|---------------|---------------|---------------|---------------|
78
+ | **Quality of Individual Unit Tests** | | | | |
79
+ | Llama3.1-8B | 60.02 | 44.97 | 13.66 | 46.13 |
80
+ | Llama3.1-70B | **73.65** | **70.15** | **11.10** | **34.51** |
81
+ | *Model (Ours)* | <u>69.64</u> | <u>63.63</u> | <u>11.17</u> | <u>38.55</u> |
82
+ | **Quality of Multiple Unit Tests** | | | | |
83
+ | Llama3.1-8B | 74.21 | 74.35 | 20.44 | 30.55 |
84
+ | Llama3.1-70B | <u>78.30</u> | <u>78.76</u> | <u>17.19</u> | <u>25.97</u> |
85
+ | *Model (Ours)* | **80.46** | **81.27** | **16.48** | **22.71** |
86
+
87
  # Prompt Format
88
 
89
  ```
 
99
  ```
100
 
101
  # Citation
102
+
103
  If you find our model helpful, please cite the original paper:
104
  ```
105
+ ```
106
+
107
+ # Contact
108
+
109
+ If you have any problems, feel free to raise an issue or reach out to us via email at: <[email protected]>.