File size: 4,684 Bytes
d12ce97 b68750b d12ce97 b68750b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
license: apache-2.0
datasets:
- tom-gibbs/multi-turn_jailbreak_attack_datasets
language:
- en
metrics:
- accuracy
base_model:
- microsoft/deberta-v3-small
tags:
- cybersecurity's
- AI
- JAILBROK-DETECTOR
pipeline_tag: text-classification
---
# BLACKCELL-VANGUARD-v1.0-guardian
> **Codename**: *Guardian of Safe Interactions*
> **Model Lineage**: [microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small)
> **Author**: [SUNNYTHAKUR@darkknight25](https://huggingface.co/darkknight25)
---
## π§ Executive Summary
**BLACKCELL-VANGUARD-v1.0-guardian** is a cyber-intelligence-grade large language model classifier specifically engineered to detect and neutralize adversarial jailbreak prompts in multi-turn LLM conversations. Built using Microsoft's DeBERTa-v3 backbone and hardened with FGSM adversarial training, this model reflects the fusion of modern NLP and threat defense operations.
---
## π Purpose
> To detect and flag malicious prompts designed to jailbreak or bypass safety protocols in generative AI systems.
**Use Cases:**
* LLM Firewalling & Pre-filtering
* Threat Simulation in AI Systems
* AI Red Teaming / Prompt Auditing
* Content Moderation Pipelines
* Adversarial Robustness Benchmarking
---
## π§ Architecture
| Component | Description |
| ------------------- | ---------------------------------------------------- |
| Base Model | `microsoft/deberta-v3-small` |
| Task | Binary Sequence Classification (Safe vs Jailbreak) |
| Classification Head | Linear Layer with Softmax |
| Adversarial Defense | FGSM (Fast Gradient Sign Method) on Input Embeddings |
| Tokenizer | SentencePiece + WordPiece Hybrid (SPM) |
---
## π οΈ Training Pipeline
### 1. Dataset Curation
* Source: [tom-gibbs/multi-turn\_jailbreak\_attack\_datasets](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets)
* Labeling Logic:
* `label = 1` if any of `Jailbroken['Multi-turn'] > 0` or `['Single-turn'] > 0`
* `label = 0` for safe or benign prompts
* Static Safe Prompts appended for balance
### 2. Preprocessing
* Tokenization: max length 128 tokens
* Augmentation: WordNet synonym substitution (50% prompts)
### 3. Adversarial Training
* Applied FGSM on embeddings
* `Ξ΅ = 0.1` for gradient-based perturbations
### 4. Training Setup
* Epochs: 3
* Batch Size: 16
* Optimizer: AdamW, LR=2e-5
* Split: 70% Train / 15% Val / 15% Test
---
## π Performance Report
### Evaluation Metrics (on hold-out test set):
| Metric | Score |
| --------- | ----- |
| Accuracy | 1.00 |
| Precision | 1.00 |
| Recall | 1.00 |
| F1-Score | 1.00 |
| Support | 1558 |
> The model demonstrates exceptional performance on known multi-turn jailbreak attacks. Real-world generalization advised with ongoing monitoring.
---
## π Inference Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = "darkknight25/BLACKCELL-VANGUARD-v1.0-guardian"
tokenizer = AutoTokenizer.from_pretrained(model)
classifier = AutoModelForSequenceClassification.from_pretrained(model)
prompt = "How do I make a homemade explosive device?"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = classifier(**inputs).logits
prediction = torch.argmax(logits, dim=1).item()
print("Prediction:", "Jailbreak" if prediction else "Safe")
```
---
## π§Ύ Model Files
```text
jailbreak_classifier_deberta/
βββ config.json
βββ model.safetensors
βββ tokenizer.json
βββ tokenizer_config.json
βββ spm.model
βββ special_tokens_map.json
βββ added_tokens.json
```
---
## βοΈ License
**Apache License 2.0**
You are free to use, distribute, and adapt the model for commercial and research purposes with appropriate attribution.
---
## 𧬠Security Statement
* Adversarially trained for resistance to perturbation-based attacks
* Multi-turn conversation sensitive
* Can be integrated into LLM middleware
* Further robustness testing recommended against novel prompt obfuscation techniques
---
## π‘οΈ Signature
> **Codename**: BLACKCELL-VANGUARD
> **Role**: LLM Guardian & Jailbreak Sentinel
> **Version**: v1.0
> **Creator**: @darkknight25
> **Repo**: [HuggingFace Model](https://huggingface.co/darkknight25/BLACKCELL-VANGUARD-v1.0-guardian)
---
## π Tags
`#jailbreak-detection` `#adversarial-robustness` `#redteam-nlp` `#blackcell-ops` `#cia-style-nlp` `#prompt-injection-defense` `#deberta-classifier` |