noneUsername commited on
Commit
e1ae9fc
·
verified ·
1 Parent(s): c60068f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +188 -0
README.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - mistralai/Devstral-Small-2505
4
+ ---
5
+ vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
6
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
7
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
8
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.864|± |0.0217|
9
+ | | |strict-match | 5|exact_match|↑ |0.860|± |0.0220|
10
+
11
+ vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
12
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
13
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
14
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.868|± |0.0152|
15
+ | | |strict-match | 5|exact_match|↑ |0.864|± |0.0153|
16
+
17
+ vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
18
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
19
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
20
+ |mmlu | 2|none | |acc |↑ |0.7965|± |0.0129|
21
+ | - humanities | 2|none | |acc |↑ |0.8205|± |0.0244|
22
+ | - other | 2|none | |acc |↑ |0.8308|± |0.0259|
23
+ | - social sciences| 2|none | |acc |↑ |0.8444|± |0.0261|
24
+ | - stem | 2|none | |acc |↑ |0.7263|± |0.0252|
25
+
26
+
27
+ vllm (pretrained=/root/autodl-tmp/80-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
28
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
29
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
30
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.840|± |0.0232|
31
+ | | |strict-match | 5|exact_match|↑ |0.832|± |0.0237|
32
+
33
+
34
+ vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
35
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
36
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
37
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.840|± |0.0232|
38
+ | | |strict-match | 5|exact_match|↑ |0.828|± |0.0239|
39
+
40
+ vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
41
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
42
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
43
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.846|± |0.0162|
44
+ | | |strict-match | 5|exact_match|↑ |0.836|± |0.0166|
45
+
46
+ vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
47
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
48
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
49
+ |mmlu | 2|none | |acc |↑ |0.7532|± |0.0140|
50
+ | - humanities | 2|none | |acc |↑ |0.7744|± |0.0272|
51
+ | - other | 2|none | |acc |↑ |0.7692|± |0.0291|
52
+ | - social sciences| 2|none | |acc |↑ |0.8278|± |0.0277|
53
+ | - stem | 2|none | |acc |↑ |0.6807|± |0.0268|
54
+
55
+
56
+ vllm (pretrained=/root/autodl-tmp/root-W8A8-86-128-3096-2,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
57
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
58
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
59
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.812|± |0.0248|
60
+ | | |strict-match | 5|exact_match|↑ |0.800|± |0.0253|
61
+
62
+
63
+ vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
64
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
65
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
66
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.848|± |0.0228|
67
+ | | |strict-match | 5|exact_match|↑ |0.836|± |0.0235|
68
+
69
+ vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
70
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
71
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
72
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.844|± |0.0162|
73
+ | | |strict-match | 5|exact_match|↑ |0.830|± |0.0168|
74
+
75
+ vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
76
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
77
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
78
+ |mmlu | 2|none | |acc |↑ |0.7614|± |0.0137|
79
+ | - humanities | 2|none | |acc |↑ |0.7590|± |0.0277|
80
+ | - other | 2|none | |acc |↑ |0.7949|± |0.0270|
81
+ | - social sciences| 2|none | |acc |↑ |0.8389|± |0.0273|
82
+ | - stem | 2|none | |acc |↑ |0.6912|± |0.0265|
83
+
84
+
85
+ vllm (pretrained=/root/autodl-tmp/86-512-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
86
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
87
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
88
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.820|± |0.0243|
89
+ | | |strict-match | 5|exact_match|↑ |0.808|± |0.0250|
90
+
91
+
92
+ vllm (pretrained=/root/autodl-tmp/865-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
93
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
94
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
95
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.840|± |0.0232|
96
+ | | |strict-match | 5|exact_match|↑ |0.828|± |0.0239|
97
+
98
+
99
+ vllm (pretrained=/root/autodl-tmp/87-64-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
100
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
101
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
102
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.824|± |0.0241|
103
+ | | |strict-match | 5|exact_match|↑ |0.808|± |0.0250|
104
+
105
+
106
+ vllm (pretrained=/root/autodl-tmp/87-64-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
107
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
108
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
109
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.844|± |0.0230|
110
+ | | |strict-match | 5|exact_match|↑ |0.836|± |0.0235|
111
+
112
+
113
+ vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
114
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
115
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
116
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.860|± |0.0220|
117
+ | | |strict-match | 5|exact_match|↑ |0.856|± |0.0222|
118
+
119
+ vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
120
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
121
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
122
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.85|± |0.0160|
123
+ | | |strict-match | 5|exact_match|↑ | 0.84|± |0.0164|
124
+
125
+ vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
126
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
127
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
128
+ |mmlu | 2|none | |acc |↑ |0.7509|± |0.0139|
129
+ | - humanities | 2|none | |acc |↑ |0.7949|± |0.0261|
130
+ | - other | 2|none | |acc |↑ |0.7641|± |0.0287|
131
+ | - social sciences| 2|none | |acc |↑ |0.8167|± |0.0285|
132
+ | - stem | 2|none | |acc |↑ |0.6702|± |0.0268|
133
+
134
+
135
+ vllm (pretrained=/root/autodl-tmp/87-128-3096-3,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
136
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
137
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
138
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.844|± |0.0230|
139
+ | | |strict-match | 5|exact_match|↑ |0.832|± |0.0237|
140
+
141
+
142
+ vllm (pretrained=/root/autodl-tmp/87-128-3096-4,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
143
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
144
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
145
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.804|± |0.0252|
146
+ | | |strict-match | 5|exact_match|↑ |0.804|± |0.0252|
147
+
148
+
149
+ vllm (pretrained=/root/autodl-tmp/87-128-4096-2,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
150
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
151
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
152
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.824|± |0.0241|
153
+ | | |strict-match | 5|exact_match|↑ |0.808|± |0.0250|
154
+
155
+
156
+ vllm (pretrained=/root/autodl-tmp/87-256-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
157
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
158
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
159
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.828|± |0.0239|
160
+ | | |strict-match | 5|exact_match|↑ |0.816|± |0.0246|
161
+
162
+
163
+ vllm (pretrained=/root/autodl-tmp/87-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
164
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
165
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
166
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.828|± |0.0239|
167
+ | | |strict-match | 5|exact_match|↑ |0.824|± |0.0241|
168
+
169
+
170
+ vllm (pretrained=/root/autodl-tmp/88-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
171
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
172
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
173
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.848|± |0.0228|
174
+ | | |strict-match | 5|exact_match|↑ |0.844|± |0.0230|
175
+
176
+
177
+ vllm (pretrained=/root/autodl-tmp/885-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
178
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
179
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
180
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.836|± |0.0235|
181
+ | | |strict-match | 5|exact_match|↑ |0.820|± |0.0243|
182
+
183
+
184
+ vllm (pretrained=/root/autodl-tmp/89-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
185
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
186
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
187
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.828|± |0.0239|
188
+ | | |strict-match | 5|exact_match|↑ |0.824|± |0.0241|