KnutJaegersberg's picture
Upload 7 files
50018c6 verified
metadata
license: apache-2.0
datasets:
  - open-r1/OpenR1-Math-220k
  - yentinglin/s1K-1.1-trl-format
  - simplescaling/s1K-1.1
language:
  - en
metrics:
  - accuracy
base_model:
  - mistralai/Mistral-Small-24B-Instruct-2501
pipeline_tag: text-generation
tags:
  - reasoning
model-index:
  - name: yentinglin/Mistral-Small-24B-Instruct-2501-reasoning
    results:
      - task:
          type: text-generation
        dataset:
          name: MATH-500
          type: MATH
        metrics:
          - name: pass@1
            type: pass@1
            value: 0.95
            verified: false
        source:
          name: yentinglin/zhtw-reasoning-eval-leaderboard
          url: >-
            https://huggingface.co/spaces/yentinglin/zhtw-reasoning-eval-leaderboard
      - task:
          type: text-generation
        dataset:
          name: AIME 2025
          type: AIME
        metrics:
          - name: pass@1
            type: pass@1
            value: 0.5333
            verified: false
        source:
          name: yentinglin/zhtw-reasoning-eval-leaderboard
          url: >-
            https://huggingface.co/spaces/yentinglin/zhtw-reasoning-eval-leaderboard
      - task:
          type: text-generation
        dataset:
          name: AIME 2024
          type: AIME
        metrics:
          - name: pass@1
            type: pass@1
            value: 0.6667
            verified: false
        source:
          name: yentinglin/zhtw-reasoning-eval-leaderboard
          url: >-
            https://huggingface.co/spaces/yentinglin/zhtw-reasoning-eval-leaderboard
      - task:
          type: text-generation
        dataset:
          name: GPQA Diamond
          type: GPQA
        metrics:
          - name: pass@1
            type: pass@1
            value: 0.62022
            verified: false
        source:
          name: yentinglin/zhtw-reasoning-eval-leaderboard
          url: >-
            https://huggingface.co/spaces/yentinglin/zhtw-reasoning-eval-leaderboard

Mistral-Small-Reasoning

This model is a fine-tuned version of mistralai/Mistral-Small-24B-Instruct-2501, specifically optimized for mathematical reasoning tasks. It has been fine-tuned on datasets including OpenR1-Math-220k, and s1K-1.1, aiming to enhance its reasoning capabilities.

Model Details

Model Description

How to Get Started with the Model

A demo is available at twllm.com, and inference can be run using vLLM or sglang.

Training Details

The model was trained using 4×8 H100 GPUs, provided by Ubitus.

Built with Axolotl

See Training config

axolotl version: a98526ef7843a3e8aa006f260e6b4fb8912b5f1a

base_model: mistralai/Mistral-Small-24B-Instruct-2501

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

datasets:
  - path: yentinglin/s1K-1.1-trl-format
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: role
    message_field_content: content
  - path: open-r1/OpenR1-Math-220k
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: from
    message_field_content: value
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./placeholder/

sequence_len: 32768
sample_packing: true
eval_sample_packing: False
pad_to_sequence_len: true

wandb_project: Reasoning
wandb_entity:
wandb_watch:
wandb_name: Mistral-24B-SFT-220k
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 5
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
saves_per_epoch: 2
weight_decay: 0.0
deepspeed: deepspeed_configs/zero3_bf16.json
special_tokens:
  pad_token: "<pad>"

Evaluation

The evaluation code is available at Hugging Face Open-R1. Note that I have updated the AIME 25 dataset to the full set, available at AIME 2025.

Our results below are averaged over multiple runs. See our eval details here.

Pass@1 # Params MATH-500 AIME 2025 AIME 2024 GPQA Diamond
Mistral-24B-Reasoning (Ours) 24B 95.0 53.33 66.67 62.02
Mistral-24B-Instruct 24B 70.6 - - 45.3
s1.1-32B 32B 93.2 40.0 56.7 61.62
LIMO 32B 94.8 36.67 57.1 59.09
DeepSeek-R1-Distill-Llama-70B 70B 94.5 46.67 70.0 65.2
DeepSeek-R1-Distill-Qwen-32B 32B 94.3 60.0 72.6 62.1
DeepSeek-R1 671B 97.3 70.0 72.6 71.5
o1 - 96.4 79.0 - 75.7
o3-mini (high) - 97.9 86.5 - 77.2
o3-mini (medium) - 97.3 76.5 - 74.9

Citation

If you use this model, please cite:

@article{yentinglin2025_mistral_reasoning,
  author = {Yenting Lin},
  title = {Mistral-Small-24B-Instruct-2501-reasoning},
  journal = {Hugging Face},
  year = {2025},
  url = {https://huggingface.co/yentinglin/Mistral-Small-24B-Instruct-2501-reasoning}
}