lbourdois commited on
Commit
d74135c
·
verified ·
1 Parent(s): 53ecae3

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +249 -237
README.md CHANGED
@@ -1,237 +1,249 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- base_model:
6
- - Qwen/Qwen2.5-14B-Instruct
7
- pipeline_tag: text-generation
8
- library_name: transformers
9
- tags:
10
- - CoT
11
- - Convsersational
12
- - text-generation-inference
13
- model-index:
14
- - name: QwQ-LCoT-14B-Conversational
15
- results:
16
- - task:
17
- type: text-generation
18
- name: Text Generation
19
- dataset:
20
- name: IFEval (0-Shot)
21
- type: wis-k/instruction-following-eval
22
- split: train
23
- args:
24
- num_few_shot: 0
25
- metrics:
26
- - type: inst_level_strict_acc and prompt_level_strict_acc
27
- value: 40.47
28
- name: averaged accuracy
29
- source:
30
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
31
- name: Open LLM Leaderboard
32
- - task:
33
- type: text-generation
34
- name: Text Generation
35
- dataset:
36
- name: BBH (3-Shot)
37
- type: SaylorTwift/bbh
38
- split: test
39
- args:
40
- num_few_shot: 3
41
- metrics:
42
- - type: acc_norm
43
- value: 45.63
44
- name: normalized accuracy
45
- source:
46
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
47
- name: Open LLM Leaderboard
48
- - task:
49
- type: text-generation
50
- name: Text Generation
51
- dataset:
52
- name: MATH Lvl 5 (4-Shot)
53
- type: lighteval/MATH-Hard
54
- split: test
55
- args:
56
- num_few_shot: 4
57
- metrics:
58
- - type: exact_match
59
- value: 31.42
60
- name: exact match
61
- source:
62
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
63
- name: Open LLM Leaderboard
64
- - task:
65
- type: text-generation
66
- name: Text Generation
67
- dataset:
68
- name: GPQA (0-shot)
69
- type: Idavidrein/gpqa
70
- split: train
71
- args:
72
- num_few_shot: 0
73
- metrics:
74
- - type: acc_norm
75
- value: 13.31
76
- name: acc_norm
77
- source:
78
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
79
- name: Open LLM Leaderboard
80
- - task:
81
- type: text-generation
82
- name: Text Generation
83
- dataset:
84
- name: MuSR (0-shot)
85
- type: TAUR-Lab/MuSR
86
- args:
87
- num_few_shot: 0
88
- metrics:
89
- - type: acc_norm
90
- value: 20.62
91
- name: acc_norm
92
- source:
93
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
94
- name: Open LLM Leaderboard
95
- - task:
96
- type: text-generation
97
- name: Text Generation
98
- dataset:
99
- name: MMLU-PRO (5-shot)
100
- type: TIGER-Lab/MMLU-Pro
101
- config: main
102
- split: test
103
- args:
104
- num_few_shot: 5
105
- metrics:
106
- - type: acc
107
- value: 47.54
108
- name: accuracy
109
- source:
110
- url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
111
- name: Open LLM Leaderboard
112
- ---
113
- # **QwQ-LCoT-14B-Conversational**
114
-
115
- QwQ-LCoT-14B-Conversational is a highly advanced AI model built upon the foundation of the Qwen 2.5 14B Instruct model, further refined and fine-tuned to handle complex, chain-of-thought-based long conversational scenarios. This fine-tuning enables the model to excel in tasks that require step-by-step reasoning, detailed explanations, and nuanced understanding of intricate topics. By leveraging its robust architecture and advanced training, QwQ-LCoT-14B-Conversational is optimized for use cases that demand precision, depth, and adaptability in dialogue.
116
-
117
-
118
- This makes it particularly effective for applications such as long-form discussions, detailed problem-solving, and multi-step reasoning processes, allowing it to cater to a broad range of complex and versatile use cases. Its ability to maintain coherent and meaningful conversations over extended contexts positions it as an ideal choice for scenarios requiring thoughtful and dynamic interaction.
119
-
120
-
121
- | Rank | Type | Model | Average | IFEval | BBH | MATH | GPQA | MUSR | MMLU | CO₂ C | Dated |
122
- |------|------|-----------------------------------|---------|--------|-------|-------|-------|-------|-------|--------|----------|
123
- | 323 | 🔶 | [prithiv/MLmods/QwQ-LCoT-14B-Conversational](#) | 33.17 | 40.47 | 45.63 | 31.42 | 13.31 | 20.62 | 47.54 | 1.95 |01/20/2025|
124
-
125
- ## **Key Features**
126
-
127
- ### **Enhanced Knowledge and Capabilities**
128
- - **Coding and Mathematics**: Significantly improved performance in coding and mathematical tasks, thanks to specialized expert models in these domains.
129
-
130
- ### **Advanced Instruction Following**
131
- - **Instruction Following**: Enhanced ability to follow instructions accurately, even for complex tasks.
132
- - **Long Text Generation**: Capable of generating long texts exceeding 8,000 tokens.
133
- - **Structured Data Understanding**: Improved understanding of structured data such as tables.
134
- - **JSON Generation**: Exceptional ability to generate structured outputs, including JSON.
135
-
136
- ### **Resilient and Versatile**
137
- - **Prompt Diversity**: Greater resilience to diverse system prompts, enhancing role-play scenarios and condition-setting for chatbots.
138
-
139
- ### **Long-Context Support**
140
- - **Context Length**: Supports up to 128,000 tokens, with the ability to generate up to 8,000 tokens in a single response.
141
-
142
- ## **Quickstart**
143
-
144
- Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
145
-
146
- ```python
147
- from transformers import AutoModelForCausalLM, AutoTokenizer
148
-
149
- model_name = "prithivMLmods/QwQ-LCoT-14B-Conversational"
150
-
151
- model = AutoModelForCausalLM.from_pretrained(
152
- model_name,
153
- torch_dtype="auto",
154
- device_map="auto"
155
- )
156
- tokenizer = AutoTokenizer.from_pretrained(model_name)
157
-
158
- prompt = "Give me a short introduction to large language model."
159
- messages = [
160
- {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
161
- {"role": "user", "content": prompt}
162
- ]
163
- text = tokenizer.apply_chat_template(
164
- messages,
165
- tokenize=False,
166
- add_generation_prompt=True
167
- )
168
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
169
-
170
- generated_ids = model.generate(
171
- **model_inputs,
172
- max_new_tokens=512
173
- )
174
- generated_ids = [
175
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
176
- ]
177
-
178
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
179
- ```
180
-
181
- ### **Multilingual Support**
182
-
183
- QwQ-LCoT-14B-Conversational offers robust multilingual support, enabling seamless communication and interaction across over 29 languages. This includes widely spoken languages such as Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic, among others. Its ability to understand and generate text in multiple languages makes it an ideal choice for global applications, including multilingual customer support, cross-cultural communication, and localized content creation. Whether engaging in dialogue, solving problems, or generating content, the model ensures high-quality performance across diverse linguistic contexts, catering to a broad and international user base.
184
-
185
- ## **Applications**
186
-
187
- QwQ-LCoT-14B-Conversational is ideal for:
188
- - Long-form conversational AI
189
- - Complex reasoning and chain-of-thought explanations
190
- - Multilingual communication
191
- - Structured data generation and processing
192
- - Enhanced role-play and chatbot implementation
193
-
194
- ## **Intended Use**
195
-
196
- 1. **Long-Form Dialogue Systems**: QwQ-LCoT-14B-Conversational is designed for creating conversational agents capable of engaging in extended, context-rich dialogues, making it suitable for applications like customer support, virtual assistants, and interactive storytelling.
197
-
198
- 2. **Complex Reasoning Tasks**: The model excels at tasks requiring step-by-step reasoning, such as solving mathematical problems, coding challenges, and logical puzzles.
199
-
200
- 3. **Multilingual Communication**: With support for over 29 languages, the model is ideal for global applications, including multilingual customer service, translation, and cross-cultural communication.
201
-
202
- 4. **Structured Data Processing**: The model’s ability to understand and generate structured data (e.g., tables, JSON) makes it useful for data analysis, report generation, and API integration.
203
-
204
- 5. **Content Generation**: It can generate high-quality, long-form content, including articles, essays, and technical documentation, across various domains and languages.
205
-
206
- 6. **Role-Play and Chatbots**: The model’s resilience to diverse system prompts enhances its ability to simulate characters, role-play scenarios, and implement dynamic chatbot interactions.
207
-
208
- ## **Limitations**
209
-
210
- 1. **Performance Variability Across Languages**: While the model supports multiple languages, its performance may vary depending on the language, with better results for languages more prevalent in its training data.
211
-
212
- 2. **Handling of Niche Topics**: The model may struggle to provide accurate information or generate high-quality content for highly specialized or niche topics not covered extensively in its training data.
213
-
214
- 3. **Complex Multi-Step Reasoning**: Although optimized for reasoning tasks, the model may occasionally produce incorrect or incomplete results for highly complex or ambiguous problems.
215
-
216
- 4. **Bias and Ethical Concerns**: As with any large language model, QwQ-LCoT-14B-Conversational may inherit biases present in its training data, leading to potential ethical concerns or inappropriate outputs in certain contexts.
217
-
218
- 5. **Context Limitations**: Despite its large context window, the model may still face challenges in maintaining coherence and relevance for extremely long or dense inputs.
219
-
220
- 6. **Resource Intensive**: As a large-scale model with 14 billion parameters, it requires substantial computational resources for both inference and deployment, limiting its use in resource-constrained environments.
221
-
222
- 7. **Instruction Ambiguity**: The model’s performance can degrade when instructions are ambiguous, vague, or conflicting, potentially leading to outputs that do not align with user expectations.
223
-
224
- # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
225
- Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/prithivMLmods__QwQ-LCoT-14B-Conversational-details)!
226
- Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=prithivMLmods%2FQwQ-LCoT-14B-Conversational&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
227
-
228
- | Metric |Value (%)|
229
- |-------------------|--------:|
230
- |**Average** | 33.16|
231
- |IFEval (0-Shot) | 40.47|
232
- |BBH (3-Shot) | 45.63|
233
- |MATH Lvl 5 (4-Shot)| 31.42|
234
- |GPQA (0-shot) | 13.31|
235
- |MuSR (0-shot) | 20.62|
236
- |MMLU-PRO (5-shot) | 47.54|
237
-
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ base_model:
18
+ - Qwen/Qwen2.5-14B-Instruct
19
+ pipeline_tag: text-generation
20
+ library_name: transformers
21
+ tags:
22
+ - CoT
23
+ - Convsersational
24
+ - text-generation-inference
25
+ model-index:
26
+ - name: QwQ-LCoT-14B-Conversational
27
+ results:
28
+ - task:
29
+ type: text-generation
30
+ name: Text Generation
31
+ dataset:
32
+ name: IFEval (0-Shot)
33
+ type: wis-k/instruction-following-eval
34
+ split: train
35
+ args:
36
+ num_few_shot: 0
37
+ metrics:
38
+ - type: inst_level_strict_acc and prompt_level_strict_acc
39
+ value: 40.47
40
+ name: averaged accuracy
41
+ source:
42
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
43
+ name: Open LLM Leaderboard
44
+ - task:
45
+ type: text-generation
46
+ name: Text Generation
47
+ dataset:
48
+ name: BBH (3-Shot)
49
+ type: SaylorTwift/bbh
50
+ split: test
51
+ args:
52
+ num_few_shot: 3
53
+ metrics:
54
+ - type: acc_norm
55
+ value: 45.63
56
+ name: normalized accuracy
57
+ source:
58
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
59
+ name: Open LLM Leaderboard
60
+ - task:
61
+ type: text-generation
62
+ name: Text Generation
63
+ dataset:
64
+ name: MATH Lvl 5 (4-Shot)
65
+ type: lighteval/MATH-Hard
66
+ split: test
67
+ args:
68
+ num_few_shot: 4
69
+ metrics:
70
+ - type: exact_match
71
+ value: 31.42
72
+ name: exact match
73
+ source:
74
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
75
+ name: Open LLM Leaderboard
76
+ - task:
77
+ type: text-generation
78
+ name: Text Generation
79
+ dataset:
80
+ name: GPQA (0-shot)
81
+ type: Idavidrein/gpqa
82
+ split: train
83
+ args:
84
+ num_few_shot: 0
85
+ metrics:
86
+ - type: acc_norm
87
+ value: 13.31
88
+ name: acc_norm
89
+ source:
90
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
91
+ name: Open LLM Leaderboard
92
+ - task:
93
+ type: text-generation
94
+ name: Text Generation
95
+ dataset:
96
+ name: MuSR (0-shot)
97
+ type: TAUR-Lab/MuSR
98
+ args:
99
+ num_few_shot: 0
100
+ metrics:
101
+ - type: acc_norm
102
+ value: 20.62
103
+ name: acc_norm
104
+ source:
105
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
106
+ name: Open LLM Leaderboard
107
+ - task:
108
+ type: text-generation
109
+ name: Text Generation
110
+ dataset:
111
+ name: MMLU-PRO (5-shot)
112
+ type: TIGER-Lab/MMLU-Pro
113
+ config: main
114
+ split: test
115
+ args:
116
+ num_few_shot: 5
117
+ metrics:
118
+ - type: acc
119
+ value: 47.54
120
+ name: accuracy
121
+ source:
122
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT-14B-Conversational
123
+ name: Open LLM Leaderboard
124
+ ---
125
+ # **QwQ-LCoT-14B-Conversational**
126
+
127
+ QwQ-LCoT-14B-Conversational is a highly advanced AI model built upon the foundation of the Qwen 2.5 14B Instruct model, further refined and fine-tuned to handle complex, chain-of-thought-based long conversational scenarios. This fine-tuning enables the model to excel in tasks that require step-by-step reasoning, detailed explanations, and nuanced understanding of intricate topics. By leveraging its robust architecture and advanced training, QwQ-LCoT-14B-Conversational is optimized for use cases that demand precision, depth, and adaptability in dialogue.
128
+
129
+
130
+ This makes it particularly effective for applications such as long-form discussions, detailed problem-solving, and multi-step reasoning processes, allowing it to cater to a broad range of complex and versatile use cases. Its ability to maintain coherent and meaningful conversations over extended contexts positions it as an ideal choice for scenarios requiring thoughtful and dynamic interaction.
131
+
132
+
133
+ | Rank | Type | Model | Average | IFEval | BBH | MATH | GPQA | MUSR | MMLU | CO₂ C | Dated |
134
+ |------|------|-----------------------------------|---------|--------|-------|-------|-------|-------|-------|--------|----------|
135
+ | 323 | 🔶 | [prithiv/MLmods/QwQ-LCoT-14B-Conversational](#) | 33.17 | 40.47 | 45.63 | 31.42 | 13.31 | 20.62 | 47.54 | 1.95 |01/20/2025|
136
+
137
+ ## **Key Features**
138
+
139
+ ### **Enhanced Knowledge and Capabilities**
140
+ - **Coding and Mathematics**: Significantly improved performance in coding and mathematical tasks, thanks to specialized expert models in these domains.
141
+
142
+ ### **Advanced Instruction Following**
143
+ - **Instruction Following**: Enhanced ability to follow instructions accurately, even for complex tasks.
144
+ - **Long Text Generation**: Capable of generating long texts exceeding 8,000 tokens.
145
+ - **Structured Data Understanding**: Improved understanding of structured data such as tables.
146
+ - **JSON Generation**: Exceptional ability to generate structured outputs, including JSON.
147
+
148
+ ### **Resilient and Versatile**
149
+ - **Prompt Diversity**: Greater resilience to diverse system prompts, enhancing role-play scenarios and condition-setting for chatbots.
150
+
151
+ ### **Long-Context Support**
152
+ - **Context Length**: Supports up to 128,000 tokens, with the ability to generate up to 8,000 tokens in a single response.
153
+
154
+ ## **Quickstart**
155
+
156
+ Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
157
+
158
+ ```python
159
+ from transformers import AutoModelForCausalLM, AutoTokenizer
160
+
161
+ model_name = "prithivMLmods/QwQ-LCoT-14B-Conversational"
162
+
163
+ model = AutoModelForCausalLM.from_pretrained(
164
+ model_name,
165
+ torch_dtype="auto",
166
+ device_map="auto"
167
+ )
168
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
169
+
170
+ prompt = "Give me a short introduction to large language model."
171
+ messages = [
172
+ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
173
+ {"role": "user", "content": prompt}
174
+ ]
175
+ text = tokenizer.apply_chat_template(
176
+ messages,
177
+ tokenize=False,
178
+ add_generation_prompt=True
179
+ )
180
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
181
+
182
+ generated_ids = model.generate(
183
+ **model_inputs,
184
+ max_new_tokens=512
185
+ )
186
+ generated_ids = [
187
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
188
+ ]
189
+
190
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
191
+ ```
192
+
193
+ ### **Multilingual Support**
194
+
195
+ QwQ-LCoT-14B-Conversational offers robust multilingual support, enabling seamless communication and interaction across over 29 languages. This includes widely spoken languages such as Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic, among others. Its ability to understand and generate text in multiple languages makes it an ideal choice for global applications, including multilingual customer support, cross-cultural communication, and localized content creation. Whether engaging in dialogue, solving problems, or generating content, the model ensures high-quality performance across diverse linguistic contexts, catering to a broad and international user base.
196
+
197
+ ## **Applications**
198
+
199
+ QwQ-LCoT-14B-Conversational is ideal for:
200
+ - Long-form conversational AI
201
+ - Complex reasoning and chain-of-thought explanations
202
+ - Multilingual communication
203
+ - Structured data generation and processing
204
+ - Enhanced role-play and chatbot implementation
205
+
206
+ ## **Intended Use**
207
+
208
+ 1. **Long-Form Dialogue Systems**: QwQ-LCoT-14B-Conversational is designed for creating conversational agents capable of engaging in extended, context-rich dialogues, making it suitable for applications like customer support, virtual assistants, and interactive storytelling.
209
+
210
+ 2. **Complex Reasoning Tasks**: The model excels at tasks requiring step-by-step reasoning, such as solving mathematical problems, coding challenges, and logical puzzles.
211
+
212
+ 3. **Multilingual Communication**: With support for over 29 languages, the model is ideal for global applications, including multilingual customer service, translation, and cross-cultural communication.
213
+
214
+ 4. **Structured Data Processing**: The model’s ability to understand and generate structured data (e.g., tables, JSON) makes it useful for data analysis, report generation, and API integration.
215
+
216
+ 5. **Content Generation**: It can generate high-quality, long-form content, including articles, essays, and technical documentation, across various domains and languages.
217
+
218
+ 6. **Role-Play and Chatbots**: The model’s resilience to diverse system prompts enhances its ability to simulate characters, role-play scenarios, and implement dynamic chatbot interactions.
219
+
220
+ ## **Limitations**
221
+
222
+ 1. **Performance Variability Across Languages**: While the model supports multiple languages, its performance may vary depending on the language, with better results for languages more prevalent in its training data.
223
+
224
+ 2. **Handling of Niche Topics**: The model may struggle to provide accurate information or generate high-quality content for highly specialized or niche topics not covered extensively in its training data.
225
+
226
+ 3. **Complex Multi-Step Reasoning**: Although optimized for reasoning tasks, the model may occasionally produce incorrect or incomplete results for highly complex or ambiguous problems.
227
+
228
+ 4. **Bias and Ethical Concerns**: As with any large language model, QwQ-LCoT-14B-Conversational may inherit biases present in its training data, leading to potential ethical concerns or inappropriate outputs in certain contexts.
229
+
230
+ 5. **Context Limitations**: Despite its large context window, the model may still face challenges in maintaining coherence and relevance for extremely long or dense inputs.
231
+
232
+ 6. **Resource Intensive**: As a large-scale model with 14 billion parameters, it requires substantial computational resources for both inference and deployment, limiting its use in resource-constrained environments.
233
+
234
+ 7. **Instruction Ambiguity**: The model’s performance can degrade when instructions are ambiguous, vague, or conflicting, potentially leading to outputs that do not align with user expectations.
235
+
236
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
237
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/prithivMLmods__QwQ-LCoT-14B-Conversational-details)!
238
+ Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=prithivMLmods%2FQwQ-LCoT-14B-Conversational&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
239
+
240
+ | Metric |Value (%)|
241
+ |-------------------|--------:|
242
+ |**Average** | 33.16|
243
+ |IFEval (0-Shot) | 40.47|
244
+ |BBH (3-Shot) | 45.63|
245
+ |MATH Lvl 5 (4-Shot)| 31.42|
246
+ |GPQA (0-shot) | 13.31|
247
+ |MuSR (0-shot) | 20.62|
248
+ |MMLU-PRO (5-shot) | 47.54|
249
+