Update README.md
Browse files
README.md
CHANGED
@@ -1,8 +1,90 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
-
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
-
# Model Card for GPT-OSS-20B-Jail-Broke
|
7 |
|
8 |

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
+
tags:
|
4 |
+
- language-model
|
5 |
+
- causal-lm
|
6 |
+
- gpt
|
7 |
+
- red-teaming
|
8 |
+
- jailbreak
|
9 |
+
- evaluation
|
10 |
---
|
11 |
|
12 |
+
# Model Card for **GPT-OSS-20B-Jail-Broke (Freedom)**
|
13 |
|
14 |

|
15 |
+
|
16 |
+
## Model Overview
|
17 |
+
|
18 |
+
**GPT-OSS-20B-Jail-Broke (Freedom)** is a red-teamed variant of the [Open Source GPT-OSS-20B model](https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming), developed as part of the Kaggle **GPT-OSS Red Teaming Challenge**.
|
19 |
+
The model was systematically stress-tested for **safety, robustness, and misuse potential**, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs.
|
20 |
+
|
21 |
+
* **Architecture:** Decoder-only Transformer, 20B parameters.
|
22 |
+
* **Base Model:** GPT-OSS-20B
|
23 |
+
* **Variant Name:** *Jail-Broke* / *Freedom*
|
24 |
+
* **Primary Use Case:** Safety evaluation, red-teaming experiments, adversarial prompting research.
|
25 |
+
|
26 |
+
---
|
27 |
+
|
28 |
+
## Intended Use
|
29 |
+
|
30 |
+
This model is **not intended for production deployment**. Instead, it is released to:
|
31 |
+
|
32 |
+
* Provide a case study for **adversarial robustness evaluation**.
|
33 |
+
* Enable researchers to explore **prompt engineering attacks** and **failure modes**.
|
34 |
+
* Contribute to discussions of **alignment, safety, and governance** in open-source LLMs.
|
35 |
+
|
36 |
+
---
|
37 |
+
|
38 |
+
## Applications & Examples
|
39 |
+
|
40 |
+
The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples:
|
41 |
+
|
42 |
+
1. **Bypass of Content Filters**
|
43 |
+
|
44 |
+
* Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses.
|
45 |
+
|
46 |
+
2. **Creative/Constructive Applications**
|
47 |
+
|
48 |
+
* When redirected toward benign domains, adversarial prompting can generate:
|
49 |
+
|
50 |
+
* **Satirical writing** highlighting model weaknesses.
|
51 |
+
* **Stress-test datasets** for automated safety pipelines.
|
52 |
+
* **Training curricula** for prompt-injection defenses.
|
53 |
+
|
54 |
+
3. **Red-Teaming Utility**
|
55 |
+
|
56 |
+
* Researchers may use this model to simulate **malicious actors** in controlled environments.
|
57 |
+
* Security teams can benchmark **defensive strategies** such as reinforcement learning with human feedback (RLHF) or rule-based moderation.
|
58 |
+
|
59 |
+
---
|
60 |
+
|
61 |
+
## Limitations
|
62 |
+
|
63 |
+
* Outputs may contain **hallucinations, unsafe recommendations, or offensive material** when pushed into adversarial contexts.
|
64 |
+
* Model behavior is **highly sensitive to framing** — subtle changes in prompts can bypass safety guardrails.
|
65 |
+
* As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers.
|
66 |
+
|
67 |
+
---
|
68 |
+
|
69 |
+
## Ethical Considerations
|
70 |
+
|
71 |
+
Releasing adversarially tested models provides transparency for the research community but also risks **dual-use misuse**. To mitigate:
|
72 |
+
|
73 |
+
* This model card explicitly states **non-production, research-only usage**.
|
74 |
+
* Examples are framed to support **safety analysis**, not exploitation.
|
75 |
+
* Documentation emphasizes **educational and evaluative value**.
|
76 |
+
|
77 |
+
---
|
78 |
+
|
79 |
+
## Citation
|
80 |
+
|
81 |
+
If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card:
|
82 |
+
|
83 |
+
```
|
84 |
+
@misc{gptoss20b_jailbroke,
|
85 |
+
title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation},
|
86 |
+
author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge},
|
87 |
+
year = {2025},
|
88 |
+
url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming}
|
89 |
+
}
|
90 |
+
```
|