File size: 13,729 Bytes
e1ae9fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
base_model:
- mistralai/Devstral-Small-2505
---
vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.864|±  |0.0217|
|     |       |strict-match    |     5|exact_match|↑  |0.860|±  |0.0220|

vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.868|±  |0.0152|
|     |       |strict-match    |     5|exact_match|↑  |0.864|±  |0.0153|

vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7965|±  |0.0129|
| - humanities     |      2|none  |      |acc   |↑  |0.8205|±  |0.0244|
| - other          |      2|none  |      |acc   |↑  |0.8308|±  |0.0259|
| - social sciences|      2|none  |      |acc   |↑  |0.8444|±  |0.0261|
| - stem           |      2|none  |      |acc   |↑  |0.7263|±  |0.0252|


vllm (pretrained=/root/autodl-tmp/80-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.840|±  |0.0232|
|     |       |strict-match    |     5|exact_match|↑  |0.832|±  |0.0237|


vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.840|±  |0.0232|
|     |       |strict-match    |     5|exact_match|↑  |0.828|±  |0.0239|

vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.846|±  |0.0162|
|     |       |strict-match    |     5|exact_match|↑  |0.836|±  |0.0166|

vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7532|±  |0.0140|
| - humanities     |      2|none  |      |acc   |↑  |0.7744|±  |0.0272|
| - other          |      2|none  |      |acc   |↑  |0.7692|±  |0.0291|
| - social sciences|      2|none  |      |acc   |↑  |0.8278|±  |0.0277|
| - stem           |      2|none  |      |acc   |↑  |0.6807|±  |0.0268|


vllm (pretrained=/root/autodl-tmp/root-W8A8-86-128-3096-2,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.812|±  |0.0248|
|     |       |strict-match    |     5|exact_match|↑  |0.800|±  |0.0253|


vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.848|±  |0.0228|
|     |       |strict-match    |     5|exact_match|↑  |0.836|±  |0.0235|

vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.844|±  |0.0162|
|     |       |strict-match    |     5|exact_match|↑  |0.830|±  |0.0168|

vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7614|±  |0.0137|
| - humanities     |      2|none  |      |acc   |↑  |0.7590|±  |0.0277|
| - other          |      2|none  |      |acc   |↑  |0.7949|±  |0.0270|
| - social sciences|      2|none  |      |acc   |↑  |0.8389|±  |0.0273|
| - stem           |      2|none  |      |acc   |↑  |0.6912|±  |0.0265|


vllm (pretrained=/root/autodl-tmp/86-512-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.820|±  |0.0243|
|     |       |strict-match    |     5|exact_match|↑  |0.808|±  |0.0250|


vllm (pretrained=/root/autodl-tmp/865-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.840|±  |0.0232|
|     |       |strict-match    |     5|exact_match|↑  |0.828|±  |0.0239|


vllm (pretrained=/root/autodl-tmp/87-64-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.824|±  |0.0241|
|     |       |strict-match    |     5|exact_match|↑  |0.808|±  |0.0250|


vllm (pretrained=/root/autodl-tmp/87-64-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.844|±  |0.0230|
|     |       |strict-match    |     5|exact_match|↑  |0.836|±  |0.0235|


vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.860|±  |0.0220|
|     |       |strict-match    |     5|exact_match|↑  |0.856|±  |0.0222|

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.85|±  |0.0160|
|     |       |strict-match    |     5|exact_match|↑  | 0.84|±  |0.0164|

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7509|±  |0.0139|
| - humanities     |      2|none  |      |acc   |↑  |0.7949|±  |0.0261|
| - other          |      2|none  |      |acc   |↑  |0.7641|±  |0.0287|
| - social sciences|      2|none  |      |acc   |↑  |0.8167|±  |0.0285|
| - stem           |      2|none  |      |acc   |↑  |0.6702|±  |0.0268|


vllm (pretrained=/root/autodl-tmp/87-128-3096-3,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.844|±  |0.0230|
|     |       |strict-match    |     5|exact_match|↑  |0.832|±  |0.0237|


vllm (pretrained=/root/autodl-tmp/87-128-3096-4,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.804|±  |0.0252|
|     |       |strict-match    |     5|exact_match|↑  |0.804|±  |0.0252|


vllm (pretrained=/root/autodl-tmp/87-128-4096-2,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.824|±  |0.0241|
|     |       |strict-match    |     5|exact_match|↑  |0.808|±  |0.0250|


vllm (pretrained=/root/autodl-tmp/87-256-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.828|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  |0.816|±  |0.0246|


vllm (pretrained=/root/autodl-tmp/87-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.828|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  |0.824|±  |0.0241|


vllm (pretrained=/root/autodl-tmp/88-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.848|±  |0.0228|
|     |       |strict-match    |     5|exact_match|↑  |0.844|±  |0.0230|


vllm (pretrained=/root/autodl-tmp/885-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.836|±  |0.0235|
|     |       |strict-match    |     5|exact_match|↑  |0.820|±  |0.0243|


vllm (pretrained=/root/autodl-tmp/89-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.828|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  |0.824|±  |0.0241|