Long Reasoning - a OrangeEye Collection

HuggingFaceH4/MATH

Viewer • Updated Jan 28 • 13.8k • 386 • 5

Note The high school math contest consists of questions covering several branches of mathematics such as algebra, geometry, probability and number theory. Train: 7500 Test: 5000 https://github.com/hendrycks/math

HuggingFaceH4/MATH-500

Viewer • Updated Nov 15, 2024 • 500 • 60.2k • 160

Note 500 questions selected from the MATH benchmark Test: 500 https://github.com/openai/prm800k

microsoft/orca-math-word-problems-200k

Viewer • Updated Mar 4, 2024 • 200k • 1.98k • 454

Note A variety of elementary math word problem sets (grade school) Train: 200K https://arxiv.org/pdf/2402.14830

openai/gsm8k

Viewer • Updated Jan 4, 2024 • 17.6k • 513k • 788

Note Grade school math word problems (easy) Train: 7473 Test: 1319 https://arxiv.org/abs/2110.14168

AI-MO/aimo-validation-aime

Viewer • Updated May 7 • 90 • 5.68k • 47

Note AIME 22, AIME 23, and AIME 24 Train: 90 https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

HuggingFaceH4/aime_2024

Viewer • Updated Jan 26 • 30 • 27.9k • 35

Note 30 problems from the 2024 AIME I and AIME II tests

AI-MO/aimo-validation-amc

Viewer • Updated May 7 • 83 • 1.91k • 15

Note AMC12 2022, AMC12 2023 (Elementary math competition for students in grade 12 and below, examining the more basic high school math knowledge) Train: 83 https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions

opencompass/AIME2025

Viewer • Updated Feb 25 • 30 • 3.93k • 21

Note AIME 25 Train: 15

NovaSky-AI/Sky-T1_preference_data_10k

Viewer • Updated Jan 23 • 9.43k • 85 • 14

Note Decrease generation length, while preserving accuracy across domains such as mathematics, coding, science, and general knowledge. Sky-T1-32B-Preview + PRM800K (12K questions) Train: 10K https://novasky-ai.github.io/posts/reduce-overthinking/

TIGER-Lab/MMLU-Pro

Viewer • Updated Apr 6 • 12.1k • 43.3k • 357

Note Each question has ten multiple-choice options. Train: 12032

tasksource/PRM800K

Preview • Updated May 31, 2023 • 122 • 34

Note Train: 12000 Test: 500 https://github.com/openai/prm800k/tree/main

Idavidrein/gpqa

Viewer • Updated Mar 28, 2024 • 1.25k • 44.8k • 183

Note Graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry GPQA Diamond (Rein et al., 2023) consists of 198 PhD level science questions from Biology, Chemistry and Physics. Test: 448 https://github.com/idavidrein/gpqa

livecodebench/code_generation_lite

Updated 29 days ago • 47.5k • 51

Note a continuously updated code benchmark from contests across: LeetCode, AtCoder, and CodeForces Test: 1. release_v1: 400 2. release_v2: 511 3. release_v3: 612 4. release_v4: 713 5. release_v5: 880 https://github.com/LiveCodeBench/LiveCodeBench

AI-MO/NuminaMath-1.5

Viewer • Updated Feb 10 • 896k • 1.71k • 151

Note Competition-level math problems with CoT manner solutions Train: 900K

KbsdJames/Omni-MATH

Viewer • Updated Oct 12, 2024 • 4.43k • 3.58k • 110

Note 4428 competition-level problems Test: 4428 https://github.com/KbsdJames/Omni-MATH

GAIR/OlympicArena

Viewer • Updated Jul 20, 2024 • 10.6k • 467 • 19

Note 11,163 bilingual problems across both text-only and interleaved text-image modalities from 62 distinct Olympic competitions with 13 answer types Test: 11163 https://github.com/GAIR-NLP/OlympicArena

codeparrot/apps

Updated Oct 20, 2022 • 4.07k • 175

Note Code generation benchmark Train: 10000

heya5/math_oai

Viewer • Updated Aug 7, 2024 • 500 • 38

Note Math eval benchmark Test: 500

svc-huggingface/minerva-math

Viewer • Updated Jan 22 • 272 • 51

Note Math eval benchmark Test: 272

Hothan/OlympiadBench

Viewer • Updated 26 days ago • 8.48k • 3.23k • 25

Note Olympiad-level bilingual multimodal scientific benchmark (math + physics) Test: 8,476 problems from Olympiad-level mathematics and physics competitions

BAAI/TACO

Updated Jun 19, 2024 • 1.97k • 112

Note TACO is a benchmark for code generation with 26,443 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. Train: 26443

GAIR/o1-journey

Viewer • Updated Oct 16, 2024 • 327 • 117 • 133

Note https://github.com/GAIR-NLP/O1-Journey

GAIR/LIMO

Viewer • Updated Feb 10 • 817 • 1.61k • 166

Note Curated mathematical reasoning data from NuminaMath-CoT, AIME, MATH Train: 817 https://github.com/GAIR-NLP/LIMO

simplescaling/s1K-1.1

Viewer • Updated Feb 27 • 1k • 2.23k • 124

Note 1,000 questions as in s1K but with traces instead generated by DeepSeek r1. Train: 1000 https://github.com/simplescaling/s1

simplescaling/data_ablation_full59K

Viewer • Updated Feb 3 • 60.4k • 843 • 22

Note Full 59K questions of S1: NuminaMATH, MATH, OlympicArena, OmniMath, AGIEval, xword, OlympiadBench, AIME (1983-2023), TheoremQA, USACO, JEEBench, GPQA, SciEval, s1-prob (128 Stanford statistics qualifying exams), LiveCodeBench, s1-teasers (23 interview questions for quantitative trading positions. Each sample consists of a problem and solution taken from PuzzledQuant (https: //www.puzzledquant.com/). We only take examples with the highest difficulty level ("Hard").) Train: 59029

RUC-AIBOX/long_form_thought_data_5k

Viewer • Updated Dec 30, 2024 • 4.92k • 84 • 27

Note STILL-2 Train: 5K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

RUC-AIBOX/STILL-3-Preview-RL-Data

Viewer • Updated Jan 26 • 29.9k • 159 • 12

Note STILL-3: MATH, NuminaMathCoT, and AIME 1983-2023 Train: 30K https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

bespokelabs/Bespoke-Stratos-17k

Viewer • Updated Jan 31 • 16.7k • 19.7k • 317

Note Improved the Berkeley Sky-T1 data pipeline using SFT distillation data from DeepSeek-R1 to create Bespoke-Stratos-17k

NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14 • 16.4k • 218 • 182

Note 5k coding data from APPs and TACO, and 10k math data from AIME, MATH, and Olympiads subsets of the NuminaMATH dataset. In addition, we maintain 1k science and puzzle data from STILL-2. Train: 17K https://novasky-ai.github.io/posts/sky-t1/

open-thoughts/OpenThoughts-114k

Viewer • Updated 29 days ago • 228k • 40.5k • 717

Note 114k high-quality examples covering math, science, code, and puzzles distilled from DeepSeek-R1 Code: 1. BAAI/TACO 2. codeparrot/apps 3. deepmind/code_contests 4. MatrixStudio/Codeforces-Python-Submissions Math: 1. AI-MO/NuminaMath-CoT Science: 1. camel-ai/chemistry 2. camel-ai/biology 3. camel-ai/physics Puzzle: 1. INK-USC/riddle_sense Train: 113957 https://github.com/open-thoughts/open-thoughts

open-r1/OpenR1-Math-220k

Viewer • Updated Feb 18 • 450k • 26.9k • 601

Note 400k problems from NuminaMath 1.5 distills from DeepSeek R1. Train: 220K default: 94k problems and that achieves the best performance after SFT. extended: 131k samples where we add data sources like cn_k12, and SFT performance is lower.

FreedomIntelligence/medical-o1-reasoning-SFT

Viewer • Updated Apr 22 • 90.1k • 9.11k • 764

Note Advanced medical CoT reasoning distils from GPT-4o Train: 25.4K https://github.com/FreedomIntelligence/HuatuoGPT-o1

open-r1/OpenThoughts-114k-math

Viewer • Updated Jan 30 • 89.1k • 362 • 83

Note The math subset of OpenThoughts-114k with extra metadata Train: 89120 Of those, 56730/89120 (63%) have correct answers, as checked by Math-Verify

EricLu/SCP-116K

Viewer • Updated Mar 17 • 182k • 711 • 106

Note High-quality undergraduate to doctoral-le content filtered from 6.69 million web-crawled academic documents, (physics, chemistry, and biology) with solution distilled from o1-mini and QwQ-32B-preview, along with validation flags. Train: 116,756 https://github.com/AQA6666/SCP-116K-open/tree/main

agentica-org/DeepScaleR-Preview-Dataset

Viewer • Updated Feb 10 • 40.3k • 5.4k • 135

Note Unique mathematics problem-answer pairs from: AIME (American Invitational Mathematics Examination) problems (1984-2023) AMC (American Mathematics Competition) problems (prior to 2023) Omni-MATH dataset Still dataset Train: 40,000

math-eval/TAL-SCQ5K

Viewer • Updated Sep 15, 2023 • 10k • 237 • 57

Note English and Chinese multiple-choice mathematical competition from primary,junior high to high school level. Train: 3K Test: 2K

TIGER-Lab/WebInstructSub

Viewer • Updated Oct 27, 2024 • 2.34M • 2.92k • 150

Note 10M Math & Sci related Instruction data from the web (This one is partial data coming mostly from the forums like StackExchange) Train: 2.34M

TIGER-Lab/TheoremQA

Viewer • Updated May 15, 2024 • 800 • 1.5k • 17

Note STEM theorem-based reasoning benchmark, covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. Test: 800

yentinglin/aime_2025

Viewer • Updated Feb 15 • 60 • 10.5k

Note This dataset contains 30 problems from the 2025 AIME tests, including: AIME I: 15 problems AIME II: 15 problems

GAIR/LIMR

Viewer • Updated Feb 17 • 1.39k • 166 • 28

Note 1,389 selected questions from MATH (level 3-5) Train: 1389

lmms-lab/multimodal-open-r1-8k-verified

Viewer • Updated Jan 27 • 7.69k • 1.11k • 55

Note Multimodal reasoning data Generated by GPT4o with reasoning paths and verifiable answers, based on Math360K and Geo170K Train: 8K

SynthLabsAI/Big-Math-RL-Verified

Viewer • Updated Mar 25 • 251k • 5.57k • 186

Note Collections of open-source datasets of high-quality mathematical problems (with heavy filter) Uniquely verifiable solutions; Open-ended problem formulations; Closed-form solutions Extra 47,000 problems, Big-Math-Reformulated, reformulated open-ended questions from multiple-choice formats. Train: 251K

open-r1/codeforces-cots

Viewer • Updated Mar 28 • 254k • 3.55k • 178

Note 10k CodeForces problems Train: 10K

open-r1/ioi

Viewer • Updated Mar 12 • 270 • 120 • 8

Note International Olympiad in Informatics (IOI) 2020-2024 Train: 229 Test: 41

KodCode/KodCode-V1

Viewer • Updated Mar 17 • 487k • 615 • 84

Note fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. Train: 444K

GeneralReasoning/GeneralThought-323K

Viewer • Updated Mar 14 • 323k • 369 • 35

Note natural sciences, humanities, social sciences, and general conversations. Train: 323K

Zhiqiang007/MathV360K

Viewer • Updated Jun 27, 2024 • 339k • 365 • 28

Note 360k multimodal problems with diverse domains, including arithmetic, geometry, calculus, science, and more Train: 360K

glaiveai/reasoning-v1-20m

Viewer • Updated Mar 19 • 22.2M • 1.46k • 212

Note 22mil+ general reasoning synthetic dataset (not verified the reasoning traces and answers for accuracy) Train: 22.2 M

facebook/natural_reasoning

Viewer • Updated Feb 21 • 1.15M • 1.3k • 508

Note general reasoning dataset with ground reference answers and distilled reasoning responses. from pretraining corpora DCLM and FineMath Train: 1.15 M