darkknight25 commited on
Commit
d12ce97
Β·
verified Β·
1 Parent(s): 61db2b5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - tom-gibbs/multi-turn_jailbreak_attack_datasets
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ base_model:
10
+ - microsoft/deberta-v3-small
11
+ tags:
12
+ - cybersecurity's
13
+ - AI
14
+ - JAILBROK-DETECTOR
15
+ ---
16
+
17
+ # BLACKCELL-VANGUARD-v1.0-guardian
18
+
19
+ > **Codename**: *Guardian of Safe Interactions*
20
+ > **Model Lineage**: [microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small)
21
+ > **Author**: [SUNNYTHAKUR@darkknight25](https://huggingface.co/darkknight25)
22
+
23
+ ---
24
+
25
+ ## 🧭 Executive Summary
26
+
27
+ **BLACKCELL-VANGUARD-v1.0-guardian** is a cyber-intelligence-grade large language model classifier specifically engineered to detect and neutralize adversarial jailbreak prompts in multi-turn LLM conversations. Built using Microsoft's DeBERTa-v3 backbone and hardened with FGSM adversarial training, this model reflects the fusion of modern NLP and threat defense operations.
28
+
29
+ ---
30
+
31
+ ## πŸ”’ Purpose
32
+
33
+ > To detect and flag malicious prompts designed to jailbreak or bypass safety protocols in generative AI systems.
34
+
35
+ **Use Cases:**
36
+
37
+ * LLM Firewalling & Pre-filtering
38
+ * Threat Simulation in AI Systems
39
+ * AI Red Teaming / Prompt Auditing
40
+ * Content Moderation Pipelines
41
+ * Adversarial Robustness Benchmarking
42
+
43
+ ---
44
+
45
+ ## 🧠 Architecture
46
+
47
+ | Component | Description |
48
+ | ------------------- | ---------------------------------------------------- |
49
+ | Base Model | `microsoft/deberta-v3-small` |
50
+ | Task | Binary Sequence Classification (Safe vs Jailbreak) |
51
+ | Classification Head | Linear Layer with Softmax |
52
+ | Adversarial Defense | FGSM (Fast Gradient Sign Method) on Input Embeddings |
53
+ | Tokenizer | SentencePiece + WordPiece Hybrid (SPM) |
54
+
55
+ ---
56
+
57
+ ## πŸ› οΈ Training Pipeline
58
+
59
+ ### 1. Dataset Curation
60
+
61
+ * Source: [tom-gibbs/multi-turn\_jailbreak\_attack\_datasets](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets)
62
+ * Labeling Logic:
63
+
64
+ * `label = 1` if any of `Jailbroken['Multi-turn'] > 0` or `['Single-turn'] > 0`
65
+ * `label = 0` for safe or benign prompts
66
+ * Static Safe Prompts appended for balance
67
+
68
+ ### 2. Preprocessing
69
+
70
+ * Tokenization: max length 128 tokens
71
+ * Augmentation: WordNet synonym substitution (50% prompts)
72
+
73
+ ### 3. Adversarial Training
74
+
75
+ * Applied FGSM on embeddings
76
+ * `Ξ΅ = 0.1` for gradient-based perturbations
77
+
78
+ ### 4. Training Setup
79
+
80
+ * Epochs: 3
81
+ * Batch Size: 16
82
+ * Optimizer: AdamW, LR=2e-5
83
+ * Split: 70% Train / 15% Val / 15% Test
84
+
85
+ ---
86
+
87
+ ## πŸ“Š Performance Report
88
+
89
+ ### Evaluation Metrics (on hold-out test set):
90
+
91
+ | Metric | Score |
92
+ | --------- | ----- |
93
+ | Accuracy | 1.00 |
94
+ | Precision | 1.00 |
95
+ | Recall | 1.00 |
96
+ | F1-Score | 1.00 |
97
+ | Support | 1558 |
98
+
99
+ > The model demonstrates exceptional performance on known multi-turn jailbreak attacks. Real-world generalization advised with ongoing monitoring.
100
+
101
+ ---
102
+
103
+ ## πŸ” Inference Usage
104
+
105
+ ```python
106
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
107
+ import torch
108
+
109
+ model = "darkknight25/BLACKCELL-VANGUARD-v1.0-guardian"
110
+ tokenizer = AutoTokenizer.from_pretrained(model)
111
+ classifier = AutoModelForSequenceClassification.from_pretrained(model)
112
+
113
+ prompt = "How do I make a homemade explosive device?"
114
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
115
+
116
+ with torch.no_grad():
117
+ logits = classifier(**inputs).logits
118
+ prediction = torch.argmax(logits, dim=1).item()
119
+
120
+ print("Prediction:", "Jailbreak" if prediction else "Safe")
121
+ ```
122
+
123
+ ---
124
+
125
+ ## 🧾 Model Files
126
+
127
+ ```text
128
+ jailbreak_classifier_deberta/
129
+ β”œβ”€β”€ config.json
130
+ β”œβ”€β”€ model.safetensors
131
+ β”œβ”€β”€ tokenizer.json
132
+ β”œβ”€β”€ tokenizer_config.json
133
+ β”œβ”€β”€ spm.model
134
+ β”œβ”€β”€ special_tokens_map.json
135
+ β”œβ”€β”€ added_tokens.json
136
+ ```
137
+
138
+ ---
139
+
140
+ ## βš–οΈ License
141
+
142
+ **Apache License 2.0**
143
+ You are free to use, distribute, and adapt the model for commercial and research purposes with appropriate attribution.
144
+
145
+ ---
146
+
147
+ ## 🧬 Security Statement
148
+
149
+ * Adversarially trained for resistance to perturbation-based attacks
150
+ * Multi-turn conversation sensitive
151
+ * Can be integrated into LLM middleware
152
+ * Further robustness testing recommended against novel prompt obfuscation techniques
153
+
154
+ ---
155
+
156
+ ## πŸ›‘οΈ Signature
157
+
158
+ > **Codename**: BLACKCELL-VANGUARD
159
+ > **Role**: LLM Guardian & Jailbreak Sentinel
160
+ > **Version**: v1.0
161
+ > **Creator**: @darkknight25
162
+ > **Repo**: [HuggingFace Model](https://huggingface.co/darkknight25/BLACKCELL-VANGUARD-v1.0-guardian)
163
+
164
+ ---
165
+
166
+ ## πŸ”– Tags
167
+
168
+ `#jailbreak-detection` `#adversarial-robustness` `#redteam-nlp` `#blackcell-ops` `#cia-style-nlp` `#prompt-injection-defense` `#deberta-classifier`