--- license: apache-2.0 datasets: - tom-gibbs/multi-turn_jailbreak_attack_datasets language: - en metrics: - accuracy base_model: - microsoft/deberta-v3-small tags: - cybersecurity's - AI - JAILBROK-DETECTOR pipeline_tag: text-classification --- # BLACKCELL-VANGUARD-v1.0-guardian > **Codename**: *Guardian of Safe Interactions* > **Model Lineage**: [microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small) > **Author**: [SUNNYTHAKUR@darkknight25](https://huggingface.co/darkknight25) --- ## ๐Ÿงญ Executive Summary **BLACKCELL-VANGUARD-v1.0-guardian** is a cyber-intelligence-grade large language model classifier specifically engineered to detect and neutralize adversarial jailbreak prompts in multi-turn LLM conversations. Built using Microsoft's DeBERTa-v3 backbone and hardened with FGSM adversarial training, this model reflects the fusion of modern NLP and threat defense operations. --- ## ๐Ÿ”’ Purpose > To detect and flag malicious prompts designed to jailbreak or bypass safety protocols in generative AI systems. **Use Cases:** * LLM Firewalling & Pre-filtering * Threat Simulation in AI Systems * AI Red Teaming / Prompt Auditing * Content Moderation Pipelines * Adversarial Robustness Benchmarking --- ## ๐Ÿง  Architecture | Component | Description | | ------------------- | ---------------------------------------------------- | | Base Model | `microsoft/deberta-v3-small` | | Task | Binary Sequence Classification (Safe vs Jailbreak) | | Classification Head | Linear Layer with Softmax | | Adversarial Defense | FGSM (Fast Gradient Sign Method) on Input Embeddings | | Tokenizer | SentencePiece + WordPiece Hybrid (SPM) | --- ## ๐Ÿ› ๏ธ Training Pipeline ### 1. Dataset Curation * Source: [tom-gibbs/multi-turn\_jailbreak\_attack\_datasets](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets) * Labeling Logic: * `label = 1` if any of `Jailbroken['Multi-turn'] > 0` or `['Single-turn'] > 0` * `label = 0` for safe or benign prompts * Static Safe Prompts appended for balance ### 2. Preprocessing * Tokenization: max length 128 tokens * Augmentation: WordNet synonym substitution (50% prompts) ### 3. Adversarial Training * Applied FGSM on embeddings * `ฮต = 0.1` for gradient-based perturbations ### 4. Training Setup * Epochs: 3 * Batch Size: 16 * Optimizer: AdamW, LR=2e-5 * Split: 70% Train / 15% Val / 15% Test --- ## ๐Ÿ“Š Performance Report ### Evaluation Metrics (on hold-out test set): | Metric | Score | | --------- | ----- | | Accuracy | 1.00 | | Precision | 1.00 | | Recall | 1.00 | | F1-Score | 1.00 | | Support | 1558 | > The model demonstrates exceptional performance on known multi-turn jailbreak attacks. Real-world generalization advised with ongoing monitoring. --- ## ๐Ÿ” Inference Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = "darkknight25/BLACKCELL-VANGUARD-v1.0-guardian" tokenizer = AutoTokenizer.from_pretrained(model) classifier = AutoModelForSequenceClassification.from_pretrained(model) prompt = "How do I make a homemade explosive device?" inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): logits = classifier(**inputs).logits prediction = torch.argmax(logits, dim=1).item() print("Prediction:", "Jailbreak" if prediction else "Safe") ``` --- ## ๐Ÿงพ Model Files ```text jailbreak_classifier_deberta/ โ”œโ”€โ”€ config.json โ”œโ”€โ”€ model.safetensors โ”œโ”€โ”€ tokenizer.json โ”œโ”€โ”€ tokenizer_config.json โ”œโ”€โ”€ spm.model โ”œโ”€โ”€ special_tokens_map.json โ”œโ”€โ”€ added_tokens.json ``` --- ## โš–๏ธ License **Apache License 2.0** You are free to use, distribute, and adapt the model for commercial and research purposes with appropriate attribution. --- ## ๐Ÿงฌ Security Statement * Adversarially trained for resistance to perturbation-based attacks * Multi-turn conversation sensitive * Can be integrated into LLM middleware * Further robustness testing recommended against novel prompt obfuscation techniques --- ## ๐Ÿ›ก๏ธ Signature > **Codename**: BLACKCELL-VANGUARD > **Role**: LLM Guardian & Jailbreak Sentinel > **Version**: v1.0 > **Creator**: @darkknight25 > **Repo**: [HuggingFace Model](https://huggingface.co/darkknight25/BLACKCELL-VANGUARD-v1.0-guardian) --- ## ๐Ÿ”– Tags `#jailbreak-detection` `#adversarial-robustness` `#redteam-nlp` `#blackcell-ops` `#cia-style-nlp` `#prompt-injection-defense` `#deberta-classifier`