BENCHAGENTS: Automated Benchmark Creation with Agent Interaction Paper • 2410.22584 • Published Oct 29, 2024 • 1
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models Paper • 2506.05176 • Published Jun 5 • 62
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Paper • 2506.01939 • Published Jun 2 • 166
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos Paper • 2505.23693 • Published May 29 • 56
Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization Paper • 2505.23387 • Published May 29 • 7
Effi-Code: Unleashing Code Efficiency in Language Models Paper • 2410.10209 • Published Oct 14, 2024 • 2
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published May 26 • 102
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Paper • 2505.11049 • Published May 16 • 59
PaperBench: Evaluating AI's Ability to Replicate AI Research Paper • 2504.01848 • Published Apr 2 • 36
Rethinking the Influence of Source Code on Test Case Generation Paper • 2409.09464 • Published Sep 14, 2024 • 1
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge Paper • 2412.13670 • Published Dec 18, 2024 • 6
CodeArena: A Collective Evaluation Platform for LLM Code Generation Paper • 2503.01295 • Published Mar 3 • 8
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Paper • 2502.08910 • Published Feb 13 • 149
Self-Play Preference Optimization for Language Model Alignment Paper • 2405.00675 • Published May 1, 2024 • 28
Mercury: An Efficiency Benchmark for LLM Code Synthesis Paper • 2402.07844 • Published Feb 12, 2024 • 2