Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
Abstract
Adversarial attacks using Direct Preference Optimization fine-tune language models to evade detection, leading to a significant drop in the performance of existing MGT detectors.
Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.
Community
[ACL 2025 Findings] Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It) (2025)
- Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors (2025)
- Could AI Trace and Explain the Origins of AI-Generated Images and Text? (2025)
- FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary and Multi-level Contrastive Learning (2025)
- Robust and Fine-Grained Detection of AI Generated Texts (2025)
- Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability (2025)
- OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper