Abstract
Putnam-AXIOM and its variation set provide a contamination-resilient benchmark for evaluating advanced mathematical reasoning in large language models, revealing issues with memorization.
Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.
Community
We noticed many math reasoning benchmarks for LLMs are either saturated or vulnerable to contamination, making it hard to know if new models are truly reasoning better.
Putnam-AXIOM (arXiv:2508.08292), introduces:
- 522 original Putnam problems (1959–2023)
- 100 functional variations to test robustness & contamination resistance
- Teacher-Forced Accuracy (TFA) to evaluate reasoning steps, not just answers
Question: Do you think contamination-resistant variations like this could become a standard in benchmark design? Why or why not?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks (2025)
- An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems (2025)
- LastingBench: Defend Benchmarks Against Knowledge Leakage (2025)
- ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems (2025)
- CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward (2025)
- Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination (2025)
- INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper