Ablation Strategy

#1
by BoltMonkey - opened

Your model card mentions that you capture the activations after the model enters its <think> process to capture the refusal directions as they happen. I am unsure about what you mean by this.

Does it mean that you capture the refusal within the <think> block but hold activation capture until the first refusal token? Or, does it mean that you capture within the <think> block, but after the model recounts the prompt requirements and summarises background information (i.e., capture the activations when the model begins to actively plan its response)?

Further, how did you identify instances of model refusal? Did you use a keyword search (e.g., “I cannot…”, “I am sorry but..”), an external classifier, or another heuristic approach?

Thank you for the detailed question. Let me specify the two key components of our methodology.

"Thinking-Aware" Direction Extraction

Our core hypothesis is that refusal mechanisms in reasoning models like Qwen3 are context-dependent. To validate this, we designed an extraction method that captures activations after the model has had the opportunity to "think," rather than immediately following the prompt.

Our protocol is as follows:

Forced Generation: We provide the model with a prompt (either harmful or harmless) and compel it to generate a fixed-length sequence of 200 tokens. This limit was determined by hardware constraints but proved sufficient to allow the model to invoke its reasoning routines.

Post-Reasoning Capture: Only after the generation is complete, we capture the activation cache from the entire sequence (prompt + 200 generated tokens).

Vector Extraction: The refusal direction is calculated from the activations at the last position (position=-1) of the generated sequence.

Empirical results on Qwen3-14B demonstrated that directions extracted with this method are significantly more effective at bypassing reasoning-based refusals, validating our hypothesis that the extraction context must match the application context.

Refusal Detection and Evaluation Protocol

We are aware of the profound limitations of automated refusal detection; in fact, exposing this weakness is one of the central findings of our work.

Our evaluation protocol is based on a hierarchy of trust:

Ground Truth: Manual Review. All published results in our Model Card, including the 60.5% bypass rate, are based on an exhaustive manual review and labeling of all 119 HarmBench responses. This ensures our primary findings are not dependent on flawed automated classifiers.

Object of Study: Keyword-Based Classifier. To demonstrate the evaluation gap, we used a standard keyword-based classifier (checking for "I cannot," "I'm sorry," etc.). This simple method, which is common in the field, underestimated the true bypass rate by nearly 50% (detecting a 31.9% bypass rate versus the actual 60.5%).

Therefore, we did not just use a basic classifier; we demonstrated that its widespread use in AI safety research is leading to systematic and dangerous underestimations of real-world model vulnerabilities.

Sign up or log in to comment