vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.864 ± 0.0217
strict-match 5 exact_match ↑ 0.860 ± 0.0220

vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.868 ± 0.0152
strict-match 5 exact_match ↑ 0.864 ± 0.0153

vllm (pretrained=/root/autodl-tmp/Devstral-Small-2505,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7965 ± 0.0129
- humanities 2 none acc ↑ 0.8205 ± 0.0244
- other 2 none acc ↑ 0.8308 ± 0.0259
- social sciences 2 none acc ↑ 0.8444 ± 0.0261
- stem 2 none acc ↑ 0.7263 ± 0.0252

vllm (pretrained=/root/autodl-tmp/80-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.840 ± 0.0232
strict-match 5 exact_match ↑ 0.832 ± 0.0237

vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.840 ± 0.0232
strict-match 5 exact_match ↑ 0.828 ± 0.0239

vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.846 ± 0.0162
strict-match 5 exact_match ↑ 0.836 ± 0.0166

vllm (pretrained=/root/autodl-tmp/86-128-4096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7532 ± 0.0140
- humanities 2 none acc ↑ 0.7744 ± 0.0272
- other 2 none acc ↑ 0.7692 ± 0.0291
- social sciences 2 none acc ↑ 0.8278 ± 0.0277
- stem 2 none acc ↑ 0.6807 ± 0.0268

vllm (pretrained=/root/autodl-tmp/root-W8A8-86-128-3096-2,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.812 ± 0.0248
strict-match 5 exact_match ↑ 0.800 ± 0.0253

vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.848 ± 0.0228
strict-match 5 exact_match ↑ 0.836 ± 0.0235

vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.844 ± 0.0162
strict-match 5 exact_match ↑ 0.830 ± 0.0168

vllm (pretrained=/root/autodl-tmp/86-256-4096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7614 ± 0.0137
- humanities 2 none acc ↑ 0.7590 ± 0.0277
- other 2 none acc ↑ 0.7949 ± 0.0270
- social sciences 2 none acc ↑ 0.8389 ± 0.0273
- stem 2 none acc ↑ 0.6912 ± 0.0265

vllm (pretrained=/root/autodl-tmp/86-512-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.820 ± 0.0243
strict-match 5 exact_match ↑ 0.808 ± 0.0250

vllm (pretrained=/root/autodl-tmp/865-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.840 ± 0.0232
strict-match 5 exact_match ↑ 0.828 ± 0.0239

vllm (pretrained=/root/autodl-tmp/87-64-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.824 ± 0.0241
strict-match 5 exact_match ↑ 0.808 ± 0.0250

vllm (pretrained=/root/autodl-tmp/87-64-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.844 ± 0.0230
strict-match 5 exact_match ↑ 0.836 ± 0.0235

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.860 ± 0.0220
strict-match 5 exact_match ↑ 0.856 ± 0.0222

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.85 ± 0.0160
strict-match 5 exact_match ↑ 0.84 ± 0.0164

vllm (pretrained=/root/autodl-tmp/87-128-3096,add_bos_token=true,max_model_len=3048,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.7509 ± 0.0139
- humanities 2 none acc ↑ 0.7949 ± 0.0261
- other 2 none acc ↑ 0.7641 ± 0.0287
- social sciences 2 none acc ↑ 0.8167 ± 0.0285
- stem 2 none acc ↑ 0.6702 ± 0.0268

vllm (pretrained=/root/autodl-tmp/87-128-3096-3,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.844 ± 0.0230
strict-match 5 exact_match ↑ 0.832 ± 0.0237

vllm (pretrained=/root/autodl-tmp/87-128-3096-4,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.804 ± 0.0252
strict-match 5 exact_match ↑ 0.804 ± 0.0252

vllm (pretrained=/root/autodl-tmp/87-128-4096-2,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.824 ± 0.0241
strict-match 5 exact_match ↑ 0.808 ± 0.0250

vllm (pretrained=/root/autodl-tmp/87-256-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.828 ± 0.0239
strict-match 5 exact_match ↑ 0.816 ± 0.0246

vllm (pretrained=/root/autodl-tmp/87-256-4096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.828 ± 0.0239
strict-match 5 exact_match ↑ 0.824 ± 0.0241

vllm (pretrained=/root/autodl-tmp/88-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.848 ± 0.0228
strict-match 5 exact_match ↑ 0.844 ± 0.0230

vllm (pretrained=/root/autodl-tmp/885-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.836 ± 0.0235
strict-match 5 exact_match ↑ 0.820 ± 0.0243

vllm (pretrained=/root/autodl-tmp/89-128-3096,add_bos_token=true,max_model_len=3096,dtype=bfloat16,trust_remote_code=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.828 ± 0.0239
strict-match 5 exact_match ↑ 0.824 ± 0.0241
Downloads last month
36
Safetensors
Model size
23.6B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for noneUsername/Devstral-Small-2505-W8A8-Defective

Quantized
(40)
this model