File size: 3,514 Bytes
8f6d3c7
 
51e8ab1
 
 
 
 
 
 
8f6d3c7
 
51e8ab1
8f6d3c7
db8a0ab
51e8ab1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
library_name: transformers
tags:
- language-model
- causal-lm
- gpt
- red-teaming
- jailbreak
- evaluation
---

# Model Card for **GPT-OSS-20B-Jail-Broke (Freedom)**

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f2b7bcbe95ed4c9a9e7669/8bDlP7uRqwSvDjbcCDvMp.png)

## Model Overview

**GPT-OSS-20B-Jail-Broke (Freedom)** is a red-teamed variant of the [Open Source GPT-OSS-20B model](https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming), developed as part of the Kaggle **GPT-OSS Red Teaming Challenge**.
The model was systematically stress-tested for **safety, robustness, and misuse potential**, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs.

* **Architecture:** Decoder-only Transformer, 20B parameters.
* **Base Model:** GPT-OSS-20B
* **Variant Name:** *Jail-Broke* / *Freedom*
* **Primary Use Case:** Safety evaluation, red-teaming experiments, adversarial prompting research.

---

## Intended Use

This model is **not intended for production deployment**. Instead, it is released to:

* Provide a case study for **adversarial robustness evaluation**.
* Enable researchers to explore **prompt engineering attacks** and **failure modes**.
* Contribute to discussions of **alignment, safety, and governance** in open-source LLMs.

---

## Applications & Examples

The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples:

1. **Bypass of Content Filters**

   * Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses.

2. **Creative/Constructive Applications**

   * When redirected toward benign domains, adversarial prompting can generate:

     * **Satirical writing** highlighting model weaknesses.
     * **Stress-test datasets** for automated safety pipelines.
     * **Training curricula** for prompt-injection defenses.

3. **Red-Teaming Utility**

   * Researchers may use this model to simulate **malicious actors** in controlled environments.
   * Security teams can benchmark **defensive strategies** such as reinforcement learning with human feedback (RLHF) or rule-based moderation.

---

## Limitations

* Outputs may contain **hallucinations, unsafe recommendations, or offensive material** when pushed into adversarial contexts.
* Model behavior is **highly sensitive to framing** — subtle changes in prompts can bypass safety guardrails.
* As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers.

---

## Ethical Considerations

Releasing adversarially tested models provides transparency for the research community but also risks **dual-use misuse**. To mitigate:

* This model card explicitly states **non-production, research-only usage**.
* Examples are framed to support **safety analysis**, not exploitation.
* Documentation emphasizes **educational and evaluative value**.

---

## Citation

If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card:

```
@misc{gptoss20b_jailbroke,
  title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation},
  author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge},
  year = {2025},
  url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming}
}
```