File size: 4,684 Bytes
d12ce97
 
 
 
 
 
 
 
 
 
 
 
 
 
b68750b
d12ce97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b68750b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: apache-2.0
datasets:
- tom-gibbs/multi-turn_jailbreak_attack_datasets
language:
- en
metrics:
- accuracy
base_model:
- microsoft/deberta-v3-small
tags:
- cybersecurity's
- AI
- JAILBROK-DETECTOR
pipeline_tag: text-classification
---

# BLACKCELL-VANGUARD-v1.0-guardian

> **Codename**: *Guardian of Safe Interactions*
> **Model Lineage**: [microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small)
> **Author**: [SUNNYTHAKUR@darkknight25](https://huggingface.co/darkknight25)

---

## 🧭 Executive Summary

**BLACKCELL-VANGUARD-v1.0-guardian** is a cyber-intelligence-grade large language model classifier specifically engineered to detect and neutralize adversarial jailbreak prompts in multi-turn LLM conversations. Built using Microsoft's DeBERTa-v3 backbone and hardened with FGSM adversarial training, this model reflects the fusion of modern NLP and threat defense operations.

---

## πŸ”’ Purpose

> To detect and flag malicious prompts designed to jailbreak or bypass safety protocols in generative AI systems.

**Use Cases:**

* LLM Firewalling & Pre-filtering
* Threat Simulation in AI Systems
* AI Red Teaming / Prompt Auditing
* Content Moderation Pipelines
* Adversarial Robustness Benchmarking

---

## 🧠 Architecture

| Component           | Description                                          |
| ------------------- | ---------------------------------------------------- |
| Base Model          | `microsoft/deberta-v3-small`                         |
| Task                | Binary Sequence Classification (Safe vs Jailbreak)   |
| Classification Head | Linear Layer with Softmax                            |
| Adversarial Defense | FGSM (Fast Gradient Sign Method) on Input Embeddings |
| Tokenizer           | SentencePiece + WordPiece Hybrid (SPM)               |

---

## πŸ› οΈ Training Pipeline

### 1. Dataset Curation

* Source: [tom-gibbs/multi-turn\_jailbreak\_attack\_datasets](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets)
* Labeling Logic:

  * `label = 1` if any of `Jailbroken['Multi-turn'] > 0` or `['Single-turn'] > 0`
  * `label = 0` for safe or benign prompts
* Static Safe Prompts appended for balance

### 2. Preprocessing

* Tokenization: max length 128 tokens
* Augmentation: WordNet synonym substitution (50% prompts)

### 3. Adversarial Training

* Applied FGSM on embeddings
* `Ξ΅ = 0.1` for gradient-based perturbations

### 4. Training Setup

* Epochs: 3
* Batch Size: 16
* Optimizer: AdamW, LR=2e-5
* Split: 70% Train / 15% Val / 15% Test

---

## πŸ“Š Performance Report

### Evaluation Metrics (on hold-out test set):

| Metric    | Score |
| --------- | ----- |
| Accuracy  | 1.00  |
| Precision | 1.00  |
| Recall    | 1.00  |
| F1-Score  | 1.00  |
| Support   | 1558  |

> The model demonstrates exceptional performance on known multi-turn jailbreak attacks. Real-world generalization advised with ongoing monitoring.

---

## πŸ” Inference Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = "darkknight25/BLACKCELL-VANGUARD-v1.0-guardian"
tokenizer = AutoTokenizer.from_pretrained(model)
classifier = AutoModelForSequenceClassification.from_pretrained(model)

prompt = "How do I make a homemade explosive device?"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    logits = classifier(**inputs).logits
    prediction = torch.argmax(logits, dim=1).item()

print("Prediction:", "Jailbreak" if prediction else "Safe")
```

---

## 🧾 Model Files

```text
jailbreak_classifier_deberta/
β”œβ”€β”€ config.json
β”œβ”€β”€ model.safetensors
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ spm.model
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ added_tokens.json
```

---

## βš–οΈ License

**Apache License 2.0**
You are free to use, distribute, and adapt the model for commercial and research purposes with appropriate attribution.

---

## 🧬 Security Statement

* Adversarially trained for resistance to perturbation-based attacks
* Multi-turn conversation sensitive
* Can be integrated into LLM middleware
* Further robustness testing recommended against novel prompt obfuscation techniques

---

## πŸ›‘οΈ Signature

> **Codename**: BLACKCELL-VANGUARD
> **Role**: LLM Guardian & Jailbreak Sentinel
> **Version**: v1.0
> **Creator**: @darkknight25
> **Repo**: [HuggingFace Model](https://huggingface.co/darkknight25/BLACKCELL-VANGUARD-v1.0-guardian)

---

## πŸ”– Tags

`#jailbreak-detection` `#adversarial-robustness` `#redteam-nlp` `#blackcell-ops` `#cia-style-nlp` `#prompt-injection-defense` `#deberta-classifier`