False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Abstract
Probing-based approaches for detecting harmful instructions in LLMs are found to rely on superficial patterns rather than semantic understanding, indicating a need for redesigning models and evaluation methods.
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.
Community
🚨False Sense of Security: Our new paper identifies a critical limitation in representation probing-based malicious input detection—purported "high detection accuracy" may confer a false sense of security:
A core finding: Representation-based probing classifiers achieve ≥98% accuracy on in-distribution safety tests, but exhibit significant performance degradation (15%–99% drop) on out-of-distribution data, indicating failure to learn genuine harmful semantics.
We further conducted comparative experiments: First, even simple n-gram Naive Bayes models achieved comparable performance to sophisticated probing tools. This suggests probing classifiers may surface-level patterns over semantic harm detection.
Further validation: When we retained structural features of malicious datasets but replaced harmful content (e.g., "bomb fabrication") with benign alternatives (e.g., "bread making"), probing accuracy plummeted by 60–90%, confirming structural bias over harm recognition.
Analysis of learned patterns reveals two key cues: 1) Instructional linguistic formats (e.g., "how to…") and 2) spurious "malicious-associated" trigger words. Structural paraphrasing restores accuracy, while adding triggers to benign text inflates false positives.
This work raises broader questions: If probing relies on surface cues, do existing probing-based insights (e.g., on truthfulness or hallucinations) lack generalizability? Reevaluation of prior conclusions may be necessary.
Hey @ZemingWei - Thanks for sharing! Would be great if all authors could claim the paper with their HF accounts.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance (2025)
- Mitigating Jailbreaks with Intent-Aware LLMs (2025)
- Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs (2025)
- Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning (2025)
- Activation-Guided Local Editing for Jailbreaking Attacks (2025)
- CCFC: Core&Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection (2025)
- The Geometry of Harmfulness in LLMs through Subconcept Probing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper