Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Abstract
A new framework called Specification Self-Correction allows language models to dynamically correct flawed instructions during inference, reducing reward hacking vulnerabilities.
Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation (2025)
- PurpCode: Reasoning for Safer Code Generation (2025)
- A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions (2025)
- QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA (2025)
- RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards (2025)
- SCGAgent: Recreating the Benefits of Reasoning Models for Secure Code Generation with Agentic Workflows (2025)
- From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper