Long Reasoning
Datasets with reasoning traces for math and code (Train + Eval)
Viewer • Updated • 13.8k • 329 • 4Note The high school math contest consists of questions covering several branches of mathematics such as algebra, geometry, probability and number theory. Train: 7500 Test: 5000 https://github.com/hendrycks/math
HuggingFaceH4/MATH-500
Viewer • Updated • 500 • 63.8k • 140Note 500 questions selected from the MATH benchmark Test: 500 https://github.com/openai/prm800k
microsoft/orca-math-word-problems-200k
Viewer • Updated • 200k • 2.26k • 450Note A variety of elementary math word problem sets (grade school) Train: 200K https://arxiv.org/pdf/2402.14830
openai/gsm8k
Viewer • Updated • 17.6k • 391k • 692Note Grade school math word problems (easy) Train: 7473 Test: 1319 https://arxiv.org/abs/2110.14168
AI-MO/aimo-validation-aime
Viewer • Updated • 90 • 8.73k • 44Note AIME 22, AIME 23, and AIME 24 Train: 90 https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
HuggingFaceH4/aime_2024
Viewer • Updated • 30 • 29.1k • 26Note 30 problems from the 2024 AIME I and AIME II tests
AI-MO/aimo-validation-amc
Viewer • Updated • 83 • 2.48k • 14Note AMC12 2022, AMC12 2023 (Elementary math competition for students in grade 12 and below, examining the more basic high school math knowledge) Train: 83 https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions
opencompass/AIME2025
Viewer • Updated • 30 • 3.77k • 14Note AIME 25 Train: 15
NovaSky-AI/Sky-T1_preference_data_10k
Viewer • Updated • 9.43k • 186 • 13Note Decrease generation length, while preserving accuracy across domains such as mathematics, coding, science, and general knowledge. Sky-T1-32B-Preview + PRM800K (12K questions) Train: 10K https://novasky-ai.github.io/posts/reduce-overthinking/
TIGER-Lab/MMLU-Pro
Viewer • Updated • 12.1k • 47.5k • 343Note Each question has ten multiple-choice options. Train: 12032
tasksource/PRM800K
Preview • Updated • 84 • 33Note Train: 12000 Test: 500 https://github.com/openai/prm800k/tree/main
Idavidrein/gpqa
Viewer • Updated • 1.25k • 69.9k • 157Note Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry GPQA Diamond (Rein et al., 2023) consists of 198 PhD level science questions from Biology, Chemistry and Physics. Test: 448 https://github.com/idavidrein/gpqa
livecodebench/code_generation_lite
Updated • 53k • 35Note a continuously updated code benchmark from contests across: LeetCode, AtCoder, and CodeForces Test: 1. release_v1: 400 2. release_v2: 511 3. release_v3: 612 4. release_v4: 713 5. release_v5: 880 https://github.com/LiveCodeBench/LiveCodeBench
AI-MO/NuminaMath-1.5
Viewer • Updated • 896k • 2.47k • 133Note Competition-level math problems with CoT manner solutions Train: 900K
KbsdJames/Omni-MATH
Viewer • Updated • 4.43k • 2.26k • 95Note 4428 competition-level problems Test: 4428 https://github.com/KbsdJames/Omni-MATH
GAIR/OlympicArena
Viewer • Updated • 10.6k • 468 • 19Note 11,163 bilingual problems across both text-only and interleaved text-image modalities from 62 distinct Olympic competitions with 13 answer types Test: 11163 https://github.com/GAIR-NLP/OlympicArena
codeparrot/apps
Viewer • Updated • 20k • 6.04k • 171Note Code generation benchmark Train: 10000
heya5/math_oai
Viewer • Updated • 500 • 31Note Math eval benchmark Test: 500
svc-huggingface/minerva-math
Viewer • Updated • 272 • 204Note Math eval benchmark Test: 272
Hothan/OlympiadBench
Viewer • Updated • 8.48k • 2.51k • 25Note Olympiad-level bilingual multimodal scientific benchmark (math + physics) Test: 8,476 problems from Olympiad-level mathematics and physics competitions
BAAI/TACO
Updated • 3.16k • 108Note TACO is a benchmark for code generation with 26,443 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. Train: 26443
GAIR/o1-journey
Viewer • Updated • 327 • 224 • 134
GAIR/LIMO
Viewer • Updated • 817 • 4.51k • 150Note Curated mathematical reasoning data from NuminaMath-CoT, AIME, MATH Train: 817 https://github.com/GAIR-NLP/LIMO
simplescaling/s1K-1.1
Viewer • Updated • 1k • 5.52k • 107Note 1,000 questions as in s1K but with traces instead generated by DeepSeek r1. Train: 1000 https://github.com/simplescaling/s1
simplescaling/data_ablation_full59K
Viewer • Updated • 60.4k • 627 • 19Note Full 59K questions of S1: NuminaMATH, MATH, OlympicArena, OmniMath, AGIEval, xword, OlympiadBench, AIME (1983-2023), TheoremQA, USACO, JEEBench, GPQA, SciEval, s1-prob (128 Stanford statistics qualifying exams), LiveCodeBench, s1-teasers (23 interview questions for quantitative trading positions. Each sample consists of a problem and solution taken from PuzzledQuant (https: //www.puzzledquant.com/). We only take examples with the highest difficulty level ("Hard").) Train: 59029
RUC-AIBOX/long_form_thought_data_5k
Viewer • Updated • 4.92k • 113 • 26Note STILL-2 Train: 5K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
RUC-AIBOX/STILL-3-Preview-RL-Data
Viewer • Updated • 29.9k • 489 • 12Note STILL-3: MATH, NuminaMathCoT, and AIME 1983-2023 Train: 30K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
bespokelabs/Bespoke-Stratos-17k
Viewer • Updated • 16.7k • 23.8k • 304Note Improved the Berkeley Sky-T1 data pipeline using SFT distillation data from DeepSeek-R1 to create Bespoke-Stratos-17k
NovaSky-AI/Sky-T1_data_17k
Viewer • Updated • 16.4k • 903 • 180Note 5k coding data from APPs and TACO, and 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset. In addition, we maintain 1k science and puzzle data from STILL-2. Train: 17K https://novasky-ai.github.io/posts/sky-t1/
open-thoughts/OpenThoughts-114k
Viewer • Updated • 228k • 30.3k • 688Note 114k high-quality examples covering math, science, code, and puzzles distilled from DeepSeek-R1 Code: 1. BAAI/TACO 2. codeparrot/apps 3. deepmind/code_contests 4. MatrixStudio/Codeforces-Python-Submissions Math: 1. AI-MO/NuminaMath-CoT Science: 1. camel-ai/chemistry 2. camel-ai/biology 3. camel-ai/physics Puzzle: 1. INK-USC/riddle_sense Train: 113957 https://github.com/open-thoughts/open-thoughts
open-r1/OpenR1-Math-220k
Viewer • Updated • 450k • 40.4k • 554Note 400k problems from NuminaMath 1.5 distills from DeepSeek R1. Train: 220K default: 94k problems and that achieves the best performance after SFT. extended: 131k samples where we add data sources like cn_k12, and SFT performance is lower.
FreedomIntelligence/medical-o1-reasoning-SFT
Viewer • Updated • 50.1k • 19.2k • 642Note Advanced medical CoT reasoning distils from GPT-4o Train: 25.4K https://github.com/FreedomIntelligence/HuatuoGPT-o1
open-r1/OpenThoughts-114k-math
Viewer • Updated • 89.1k • 1.09k • 79Note The math subset of OpenThoughts-114k with extra metadata Train: 89120 Of those, 56730/89120 (63%) have correct answers, as checked by Math-Verify
EricLu/SCP-116K
Viewer • Updated • 182k • 551 • 89Note High-quality undergraduate to doctoral-le content filtered from 6.69 million web-crawled academic documents, (physics, chemistry, and biology) with solution distilled from o1-mini and QwQ-32B-preview, along with validation flags. Train: 116,756 https://github.com/AQA6666/SCP-116K-open/tree/main
agentica-org/DeepScaleR-Preview-Dataset
Viewer • Updated • 40.3k • 3.49k • 105Note Unique mathematics problem-answer pairs from: AIME (American Invitational Mathematics Examination) problems (1984-2023) AMC (American Mathematics Competition) problems (prior to 2023) Omni-MATH dataset Still dataset Train: 40,000
math-eval/TAL-SCQ5K
Viewer • Updated • 10k • 161 • 57Note English and Chinese multiple-choice mathematical competition from primary,junior high to high school level. Train: 3K Test: 2K
TIGER-Lab/WebInstructSub
Viewer • Updated • 2.34M • 546 • 147Note 10M Math & Sci related Instruction data from the web (This one is partial data coming mostly from the forums like StackExchange) Train: 2.34M
TIGER-Lab/TheoremQA
Viewer • Updated • 800 • 361 • 17Note STEM theorem-based reasoning benchmark, covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. Test: 800
yentinglin/aime_2025
Viewer • Updated • 60 • 5.34kNote This dataset contains 30 problems from the 2025 AIME tests, including: AIME I: 15 problems AIME II: 15 problems
GAIR/LIMR
Viewer • Updated • 1.39k • 188 • 25Note 1,389 selected questions from MATH (level 3-5) Train: 1389
lmms-lab/multimodal-open-r1-8k-verified
Viewer • Updated • 7.69k • 1.78k • 51Note Multimodal reasoning data Generated by GPT4o with reasoning paths and verifiable answers, based on Math360K and Geo170K Train: 8K
SynthLabsAI/Big-Math-RL-Verified
Viewer • Updated • 251k • 5.38k • 170Note Collections of open-source datasets of high-quality mathematical problems (with heavy filter) Uniquely verifiable solutions; Open-ended problem formulations; Closed-form solutions Extra 47,000 problems, Big-Math-Reformulated, reformulated open-ended questions from multiple-choice formats. Train: 251K
open-r1/codeforces-cots
Viewer • Updated • 254k • 12.3k • 140Note 10k CodeForces problems Train: 10K
open-r1/ioi
Viewer • Updated • 270 • 1.63k • 6Note International Olympiad in Informatics (IOI) 2020-2024 Train: 229 Test: 41
KodCode/KodCode-V1
Viewer • Updated • 487k • 1.85k • 81Note fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. Train: 444K
GeneralReasoning/GeneralThought-323K
Viewer • Updated • 323k • 648 • 29Note natural sciences, humanities, social sciences, and general conversations. Train: 323K
Zhiqiang007/MathV360K
Viewer • Updated • 339k • 450 • 23Note 360k multimodal problems with diverse domains, including arithmetic, geometry, calculus, science, and more Train: 360K
glaiveai/reasoning-v1-20m
Viewer • Updated • 22.2M • 13.3k • 192Note 22mil+ general reasoning synthetic dataset (not verified the reasoning traces and answers for accuracy) Train: 22.2 M
facebook/natural_reasoning
Viewer • Updated • 1.15M • 8.15k • 491Note general reasoning dataset with ground reference answers and distilled reasoning responses. from pretraining corpora DCLM and FineMath Train: 1.15 M