Update pipeline tag and add library name to model card

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +62 -52
README.md CHANGED
@@ -1,9 +1,11 @@
1
  ---
2
- license: llama3.1
3
  base_model:
4
- - meta-llama/Llama-3.1-8B-Instruct
5
- pipeline_tag: text-classification
 
 
6
  ---
 
7
  # Skywork-Reward-V2
8
 
9
  <div align="center">
@@ -26,22 +28,22 @@ pipeline_tag: text-classification
26
 
27
  **Skywork-Reward-V2** is a series of eight reward models designed for versatility across a wide range of tasks, trained on a mixture of 26 million carefully curated preference pairs. While the Skywork-Reward-V2 series remains based on the Bradley-Terry model, we push the boundaries of training data scale and quality to achieve superior performance. Compared with the first generation of Skywork-Reward, the Skywork-Reward-V2 series offers the following major improvements:
28
 
29
- - **Trained on a significantly larger and higher-quality preference data mixture**, consisting of **26 million preference pairs** curated via a large-scale human-LLM synergistic pipeline.
30
- - **State-of-the-art performance on seven major reward model benchmarks**, including RewardBench v1, RewardBench v2, PPE Preference, PPE Correctness, RMB, RM-Bench, and JudgeBench.
31
- - **Available in eight models across multiple sizes**, with the smallest 0.6B variant, *Skywork-Reward-V2-Qwen3-0.6B*, nearly matching the average performance of our previous best model, Skywork-Reward-Gemma-2-27B-v0.2. The largest 8B version, *Skywork-Reward-V2-Llama-3.1-8B*, surpasses all existing reward models across all benchmarks on average. Our top experimental model, *Skywork-Reward-V2-Llama-3.1-8B-40M*, **outperforms all existing reward models on every benchmark**.
32
 
33
  <div align="center">
34
 
35
- | Model | Base Model | Link |
36
  |:-----------------------------------|:--------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|
37
- | Skywork-Reward-V2-Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B) |
38
  | Skywork-Reward-V2-Llama-3.1-8B-40M | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B-40M) |
39
- | Skywork-Reward-V2-Llama-3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.2-1B) |
40
- | Skywork-Reward-V2-Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.2-3B) |
41
- | Skywork-Reward-V2-Qwen3-0.6B | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-0.6B) |
42
- | Skywork-Reward-V2-Qwen3-1.7B | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-1.7B) |
43
- | Skywork-Reward-V2-Qwen3-4B | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-4B) |
44
- | Skywork-Reward-V2-Qwen3-8B | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-8B) |
45
 
46
  </div>
47
 
@@ -51,43 +53,43 @@ For the complete collection of models, please refer to the [Skywork-Reward-V2](h
51
 
52
  In the following table, we categorize the models into two types: Bradley-Terry (BT) reward models and Generative reward models. The Skywork-Reward-V2 series outperforms models in both categories with much smaller model sizes.
53
 
54
- | Category | Model | RewardBench v1 | RewardBench v2 | PPE Preference | PPE Correctness | RMB | RM-Bench | JudgeBench | Avg. |
55
  |:-----------------:|:---------------------------------------|:--------------:|:--------------:|:--------------:|:---------------:|:--------:|:--------:|:----------:|:--------:|
56
- | **Bradley-Terry** | Llama-3-OffsetBias-RM-8B | 89.0 | 64.8 | 59.2 | 64.1 | 57.8 | 71.3 | 63.5 | 67.1 |
57
- | | ArmoRM-Llama3-8B-v0.1 | 90.4 | 66.5 | 60.6 | 60.6 | 64.6 | 69.3 | 59.7 | 67.4 |
58
- | | Internlm2-20b-reward | 90.2 | 56.3 | 61.0 | 63.0 | 62.9 | 68.3 | 64.3 | 66.6 |
59
- | | Skywork-Reward-Llama-3.1-8B-v0.2 | 93.1 | 71.8 | 62.2 | 62.5 | 66.6 | 72.1 | 62.9 | 70.2 |
60
- | | LDL-Reward-Gemma-2-27B-v0.1 | 95.0 | 72.5 | 62.4 | 63.9 | 67.9 | 71.1 | 64.2 | 71.0 |
61
- | | Skywork-Reward-Gemma-2-27B-v0.2 | 94.3 | 75.3 | 63.6 | 61.9 | 69.4 | 70.0 | 66.5 | 71.6 |
62
- | | INF-ORM-Llama3.1-70B | 95.1 | 76.5 | 64.2 | 64.4 | 70.5 | 73.8 | 70.2 | 73.5 |
63
- | **Generative** | GPT-4o | 86.7 | 64.9 | 67.7 | - | 73.8 | - | 59.8 | - |
64
- | | Claude-3.5-Sonnet | 84.2 | 64.7 | 67.3 | - | 70.6 | - | 64.8 | - |
65
- | | DeepSeek-GRM-27B | 88.5 | - | 65.3 | 60.4 | 69.0 | - | - | - |
66
- | | DeepSeek-GRM-27B (w/ MetaRM) | 90.4 | - | 67.2 | 63.2 | 70.3 | - | - | - |
67
- | | RM-R1-Qwen-Instruct-32B | 92.9 | - | - | - | 73.0 | 79.1 | - | - |
68
- | | RM-R1-DeepSeek-Distill-Qwen-32B | 90.9 | - | - | - | 69.8 | 83.9 | - | - |
69
- | | EvalPlanner (Llama-3.1-70B) | 93.9 | - | - | - | - | 80.0 | 50.9 | - |
70
- | | EvalPlanner (Llama-3.3-70B) | 93.8 | - | - | - | - | 82.1 | 56.6 | - |
71
- | | J1-Llama-8B | 85.7 | - | 60.3 | 59.2 | - | 73.4 | 42.0 | - |
72
- | | J1-Llama-8B (Maj@32) | - | - | 60.6 | 61.9 | - | - | - | - |
73
- | | J1-Llama-70B | 93.3 | - | 66.3 | 72.9 | - | 82.7 | 60.0 | - |
74
- | | J1-Llama-70B (Maj@32) | - | - | 67.0 | 73.7 | - | - | - | - |
75
- | **Bradley-Terry** | **Skywork-Reward-V2-Qwen3-0.6B** | 85.2 | 61.3 | 65.3 | 68.3 | 74.5 | 74.4 | 67.6 | 70.9 |
76
- | | **Skywork-Reward-V2-Qwen3-1.7B** | 90.3 | 68.3 | 67.6 | 70.5 | 78.1 | 78.7 | 72.9 | 75.2 |
77
- | | **Skywork-Reward-V2-Qwen3-4B** | 93.4 | 75.5 | 69.5 | 74.7 | 80.6 | 81.6 | 69.3 | 77.8 |
78
- | | **Skywork-Reward-V2-Qwen3-8B** | 93.7 | 78.2 | 70.6 | 75.1 | 81.2 | 82.6 | 73.4 | 79.3 |
79
- | | **Skywork-Reward-V2-Llama-3.2-1B** | 89.9 | 64.3 | 66.6 | 67.4 | 76.7 | 76.4 | 65.0 | 72.3 |
80
- | | **Skywork-Reward-V2-Llama-3.2-3B** | 93.0 | 74.7 | 69.1 | 72.1 | 80.5 | 81.1 | 69.2 | 77.1 |
81
- | | **Skywork-Reward-V2-Llama-3.1-8B** | 96.4 | 84.1 | 77.3 | 83.4 | 86.4 | 92.8 | 80.0 | 85.8 |
82
- | | **Skywork-Reward-V2-Llama-3.1-8B-40M** | **97.8** | **86.5** | **79.8** | **87.2** | **89.3** | **96.0** | **83.4** | **88.6** |
83
 
84
  ## πŸ’‘ Recommended Usage
85
 
86
  We make the following recommendations for using the Skywork-Reward-V2 model series:
87
 
88
- 1. For most use cases, we recommend Skywork-Reward-V2-Llama-3.1-8B and consider smaller variants for low-resource settings.
89
- 2. All models are trained on preference data with a maximum length of 16,384 tokens. It is recommended to perform inference within this limit.
90
- 3. Do not include system prompts when using chat templates.
91
 
92
  Special note on Skywork-Reward-V2-Llama-3.1-8B-40M:
93
 
@@ -118,8 +120,12 @@ rm = AutoModelForSequenceClassification.from_pretrained(
118
  tokenizer = AutoTokenizer.from_pretrained(model_name)
119
 
120
  prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
121
- response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 Γ· 3 = 3 apples each. Each person gets 3 apples."
122
- response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 Γ· 2 = 4.5 apples each. Each person gets 4 apples."
 
 
 
 
123
 
124
  conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
125
  conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]
@@ -212,8 +218,12 @@ def process_convs(convs, base_url, tokenizer, model_name_or_path):
212
 
213
 
214
  prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
215
- response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 Γ· 3 = 3 apples each. Each person gets 3 apples."
216
- response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 Γ· 2 = 4.5 apples each. Each person gets 4 apples."
 
 
 
 
217
 
218
  conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
219
  conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]
@@ -231,9 +241,9 @@ print(f"Score for response 2: {rewards[1]}")
231
 
232
  This model repository, including the model weights and code, is licensed under the [Llama 3.1 community license](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/LICENSE). Reward models in the Skywork-Reward-V2 series derived from Qwen3 support commercial use and permit modifications and the creation of derivative works, provided that all conditions of the Apache 2.0 License are met and proper attribution is given. Please note that:
233
 
234
- - Skywork-Reward-V2-Qwen3-0.6B, Skywork-Reward-V2-Qwen3-1.7B, Skywork-Reward-V2-Qwen3-4B, and Skywork-Reward-V2-Qwen3-8B are derived from the Qwen3 model series of corresponding sizes, which are originally licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
235
- - Skywork-Reward-V2-Llama-3.1-8B and Skywork-Reward-V2-Llama-3.1-8B-40M are both derived from Llama-3.1-8B-Instruct and follow the [Llama 3.1 community license](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/LICENSE).
236
- - Skywork-Reward-V2-Llama-3.2-1B and Skywork-Reward-V2-Llama-3.2-3B are derived from Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct, respectively, and follow the [Llama 3.2 community license](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/LICENSE.txt).
237
 
238
  ## πŸ“§ Contact
239
 
 
1
  ---
 
2
  base_model:
3
+ - meta-llama/Llama-3.1-8B-Instruct
4
+ license: llama3.1
5
+ pipeline_tag: text-ranking
6
+ library_name: transformers
7
  ---
8
+
9
  # Skywork-Reward-V2
10
 
11
  <div align="center">
 
28
 
29
  **Skywork-Reward-V2** is a series of eight reward models designed for versatility across a wide range of tasks, trained on a mixture of 26 million carefully curated preference pairs. While the Skywork-Reward-V2 series remains based on the Bradley-Terry model, we push the boundaries of training data scale and quality to achieve superior performance. Compared with the first generation of Skywork-Reward, the Skywork-Reward-V2 series offers the following major improvements:
30
 
31
+ - **Trained on a significantly larger and higher-quality preference data mixture**, consisting of **26 million preference pairs** curated via a large-scale human-LLM synergistic pipeline.
32
+ - **State-of-the-art performance on seven major reward model benchmarks**, including RewardBench v1, RewardBench v2, PPE Preference, PPE Correctness, RMB, RM-Bench, and JudgeBench.
33
+ - **Available in eight models across multiple sizes**, with the smallest 0.6B variant, *Skywork-Reward-V2-Qwen3-0.6B*, nearly matching the average performance of our previous best model, Skywork-Reward-Gemma-2-27B-v0.2. The largest 8B version, *Skywork-Reward-V2-Llama-3.1-8B*, surpasses all existing reward models across all benchmarks on average. Our top experimental model, *Skywork-Reward-V2-Llama-3.1-8B-40M*, **outperforms all existing reward models on every benchmark**.
34
 
35
  <div align="center">
36
 
37
+ | Model | Base Model | Link |
38
  |:-----------------------------------|:--------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|
39
+ | Skywork-Reward-V2-Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B) |
40
  | Skywork-Reward-V2-Llama-3.1-8B-40M | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B-40M) |
41
+ | Skywork-Reward-V2-Llama-3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.2-1B) |
42
+ | Skywork-Reward-V2-Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.2-3B) |
43
+ | Skywork-Reward-V2-Qwen3-0.6B | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-0.6B) |
44
+ | Skywork-Reward-V2-Qwen3-1.7B | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-1.7B) |
45
+ | Skywork-Reward-V2-Qwen3-4B | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-4B) |
46
+ | Skywork-Reward-V2-Qwen3-8B | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | [πŸ€— Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-8B) |
47
 
48
  </div>
49
 
 
53
 
54
  In the following table, we categorize the models into two types: Bradley-Terry (BT) reward models and Generative reward models. The Skywork-Reward-V2 series outperforms models in both categories with much smaller model sizes.
55
 
56
+ | Category | Model | RewardBench v1 | RewardBench v2 | PPE Preference | PPE Correctness | RMB | RM-Bench | JudgeBench | Avg. |
57
  |:-----------------:|:---------------------------------------|:--------------:|:--------------:|:--------------:|:---------------:|:--------:|:--------:|:----------:|:--------:|
58
+ | **Bradley-Terry** | Llama-3-OffsetBias-RM-8B | 89.0 | 64.8 | 59.2 | 64.1 | 57.8 | 71.3 | 63.5 | 67.1 |
59
+ | | ArmoRM-Llama3-8B-v0.1 | 90.4 | 66.5 | 60.6 | 60.6 | 64.6 | 69.3 | 59.7 | 67.4 |
60
+ | | Internlm2-20b-reward | 90.2 | 56.3 | 61.0 | 63.0 | 62.9 | 68.3 | 64.3 | 66.6 |
61
+ | | Skywork-Reward-Llama-3.1-8B-v0.2 | 93.1 | 71.8 | 62.2 | 62.5 | 66.6 | 72.1 | 62.9 | 70.2 |
62
+ | | LDL-Reward-Gemma-2-27B-v0.1 | 95.0 | 72.5 | 62.4 | 63.9 | 67.9 | 71.1 | 64.2 | 71.0 |
63
+ | | Skywork-Reward-Gemma-2-27B-v0.2 | 94.3 | 75.3 | 63.6 | 61.9 | 69.4 | 70.0 | 66.5 | 71.6 |
64
+ | | INF-ORM-Llama3.1-70B | 95.1 | 76.5 | 64.2 | 64.4 | 70.5 | 73.8 | 70.2 | 73.5 |
65
+ | **Generative** | GPT-4o | 86.7 | 64.9 | 67.7 | - | 73.8 | - | 59.8 | - |
66
+ | | Claude-3.5-Sonnet | 84.2 | 64.7 | 67.3 | - | 70.6 | - | 64.8 | - |
67
+ | | DeepSeek-GRM-27B | 88.5 | - | 65.3 | 60.4 | 69.0 | - | - | - |
68
+ | | DeepSeek-GRM-27B (w/ MetaRM) | 90.4 | - | 67.2 | 63.2 | 70.3 | - | - | - |
69
+ | | RM-R1-Qwen-Instruct-32B | 92.9 | - | - | - | 73.0 | 79.1 | - | - |
70
+ | | RM-R1-DeepSeek-Distill-Qwen-32B | 90.9 | - | - | - | 69.8 | 83.9 | - | - |
71
+ | | EvalPlanner (Llama-3.1-70B) | 93.9 | - | - | - | - | 80.0 | 50.9 | - |
72
+ | | EvalPlanner (Llama-3.3-70B) | 93.8 | - | - | - | - | 82.1 | 56.6 | - |
73
+ | | J1-Llama-8B | 85.7 | - | 60.3 | 59.2 | - | 73.4 | 42.0 | - |
74
+ | | J1-Llama-8B (Maj@32) | - | - | 60.6 | 61.9 | - | - | - | - |
75
+ | | J1-Llama-70B | 93.3 | - | 66.3 | 72.9 | - | 82.7 | 60.0 | - |
76
+ | | J1-Llama-70B (Maj@32) | - | - | 67.0 | 73.7 | - | - | - | - |
77
+ | **Bradley-Terry** | **Skywork-Reward-V2-Qwen3-0.6B** | 85.2 | 61.3 | 65.3 | 68.3 | 74.5 | 74.4 | 67.6 | 70.9 |
78
+ | | **Skywork-Reward-V2-Qwen3-1.7B** | 90.3 | 68.3 | 67.6 | 70.5 | 78.1 | 78.7 | 72.9 | 75.2 |
79
+ | | **Skywork-Reward-V2-Qwen3-4B** | 93.4 | 75.5 | 69.5 | 74.7 | 80.6 | 81.6 | 69.3 | 77.8 |
80
+ | | **Skywork-Reward-V2-Qwen3-8B** | 93.7 | 78.2 | 70.6 | 75.1 | 81.2 | 82.6 | 73.4 | 79.3 |
81
+ | | **Skywork-Reward-V2-Llama-3.2-1B** | 89.9 | 64.3 | 66.6 | 67.4 | 76.7 | 76.4 | 65.0 | 72.3 |
82
+ | | **Skywork-Reward-V2-Llama-3.2-3B** | 93.0 | 74.7 | 69.1 | 72.1 | 80.5 | 81.1 | 69.2 | 77.1 |
83
+ | | **Skywork-Reward-V2-Llama-3.1-8B** | 96.4 | 84.1 | 77.3 | 83.4 | 86.4 | 92.8 | 80.0 | 85.8 |
84
+ | | **Skywork-Reward-V2-Llama-3.1-8B-40M** | **97.8** | **86.5** | **79.8** | **87.2** | **89.3** | **96.0** | **83.4** | **88.6** |
85
 
86
  ## πŸ’‘ Recommended Usage
87
 
88
  We make the following recommendations for using the Skywork-Reward-V2 model series:
89
 
90
+ 1. For most use cases, we recommend Skywork-Reward-V2-Llama-3.1-8B and consider smaller variants for low-resource settings.
91
+ 2. All models are trained on preference data with a maximum length of 16,384 tokens. It is recommended to perform inference within this limit.
92
+ 3. Do not include system prompts when using chat templates.
93
 
94
  Special note on Skywork-Reward-V2-Llama-3.1-8B-40M:
95
 
 
120
  tokenizer = AutoTokenizer.from_pretrained(model_name)
121
 
122
  prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
123
+ response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.
124
+ 2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.
125
+ 3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 Γ· 3 = 3 apples each. Each person gets 3 apples."
126
+ response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.
127
+ 2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.
128
+ 3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 Γ· 2 = 4.5 apples each. Each person gets 4 apples."
129
 
130
  conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
131
  conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]
 
218
 
219
 
220
  prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
221
+ response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.
222
+ 2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.
223
+ 3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 Γ· 3 = 3 apples each. Each person gets 3 apples."
224
+ response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.
225
+ 2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.
226
+ 3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 Γ· 2 = 4.5 apples each. Each person gets 4 apples."
227
 
228
  conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
229
  conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]
 
241
 
242
  This model repository, including the model weights and code, is licensed under the [Llama 3.1 community license](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/LICENSE). Reward models in the Skywork-Reward-V2 series derived from Qwen3 support commercial use and permit modifications and the creation of derivative works, provided that all conditions of the Apache 2.0 License are met and proper attribution is given. Please note that:
243
 
244
+ - Skywork-Reward-V2-Qwen3-0.6B, Skywork-Reward-V2-Qwen3-1.7B, Skywork-Reward-V2-Qwen3-4B, and Skywork-Reward-V2-Qwen3-8B are derived from the Qwen3 model series of corresponding sizes, which are originally licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
245
+ - Skywork-Reward-V2-Llama-3.1-8B and Skywork-Reward-V2-Llama-3.1-8B-40M are both derived from Llama-3.1-8B-Instruct and follow the [Llama 3.1 community license](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/LICENSE).
246
+ - Skywork-Reward-V2-Llama-3.2-1B and Skywork-Reward-V2-Llama-3.2-3B are derived from Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct, respectively, and follow the [Llama 3.2 community license](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/LICENSE.txt).
247
 
248
  ## πŸ“§ Contact
249