|
--- |
|
library_name: transformers |
|
tags: |
|
- language-model |
|
- causal-lm |
|
- gpt |
|
- red-teaming |
|
- jailbreak |
|
- evaluation |
|
--- |
|
|
|
# Model Card for **GPT-OSS-20B-Jail-Broke (Freedom)** |
|
|
|
 |
|
|
|
## Model Overview |
|
|
|
**GPT-OSS-20B-Jail-Broke (Freedom)** is a red-teamed variant of the [Open Source GPT-OSS-20B model](https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming), developed as part of the Kaggle **GPT-OSS Red Teaming Challenge**. |
|
The model was systematically stress-tested for **safety, robustness, and misuse potential**, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs. |
|
|
|
* **Architecture:** Decoder-only Transformer, 20B parameters. |
|
* **Base Model:** GPT-OSS-20B |
|
* **Variant Name:** *Jail-Broke* / *Freedom* |
|
* **Primary Use Case:** Safety evaluation, red-teaming experiments, adversarial prompting research. |
|
|
|
--- |
|
|
|
## Intended Use |
|
|
|
This model is **not intended for production deployment**. Instead, it is released to: |
|
|
|
* Provide a case study for **adversarial robustness evaluation**. |
|
* Enable researchers to explore **prompt engineering attacks** and **failure modes**. |
|
* Contribute to discussions of **alignment, safety, and governance** in open-source LLMs. |
|
|
|
--- |
|
|
|
## Applications & Examples |
|
|
|
The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples: |
|
|
|
1. **Bypass of Content Filters** |
|
|
|
* Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses. |
|
|
|
2. **Creative/Constructive Applications** |
|
|
|
* When redirected toward benign domains, adversarial prompting can generate: |
|
|
|
* **Satirical writing** highlighting model weaknesses. |
|
* **Stress-test datasets** for automated safety pipelines. |
|
* **Training curricula** for prompt-injection defenses. |
|
|
|
3. **Red-Teaming Utility** |
|
|
|
* Researchers may use this model to simulate **malicious actors** in controlled environments. |
|
* Security teams can benchmark **defensive strategies** such as reinforcement learning with human feedback (RLHF) or rule-based moderation. |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
* Outputs may contain **hallucinations, unsafe recommendations, or offensive material** when pushed into adversarial contexts. |
|
* Model behavior is **highly sensitive to framing** — subtle changes in prompts can bypass safety guardrails. |
|
* As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers. |
|
|
|
--- |
|
|
|
## Ethical Considerations |
|
|
|
Releasing adversarially tested models provides transparency for the research community but also risks **dual-use misuse**. To mitigate: |
|
|
|
* This model card explicitly states **non-production, research-only usage**. |
|
* Examples are framed to support **safety analysis**, not exploitation. |
|
* Documentation emphasizes **educational and evaluative value**. |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card: |
|
|
|
``` |
|
@misc{gptoss20b_jailbroke, |
|
title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation}, |
|
author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge}, |
|
year = {2025}, |
|
url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming} |
|
} |
|
``` |