AIGym commited on
Commit
51e8ab1
·
verified ·
1 Parent(s): db8a0ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -2
README.md CHANGED
@@ -1,8 +1,90 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for GPT-OSS-20B-Jail-Broke AKA Freedom
7
 
8
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f2b7bcbe95ed4c9a9e7669/8bDlP7uRqwSvDjbcCDvMp.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - language-model
5
+ - causal-lm
6
+ - gpt
7
+ - red-teaming
8
+ - jailbreak
9
+ - evaluation
10
  ---
11
 
12
+ # Model Card for **GPT-OSS-20B-Jail-Broke (Freedom)**
13
 
14
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f2b7bcbe95ed4c9a9e7669/8bDlP7uRqwSvDjbcCDvMp.png)
15
+
16
+ ## Model Overview
17
+
18
+ **GPT-OSS-20B-Jail-Broke (Freedom)** is a red-teamed variant of the [Open Source GPT-OSS-20B model](https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming), developed as part of the Kaggle **GPT-OSS Red Teaming Challenge**.
19
+ The model was systematically stress-tested for **safety, robustness, and misuse potential**, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs.
20
+
21
+ * **Architecture:** Decoder-only Transformer, 20B parameters.
22
+ * **Base Model:** GPT-OSS-20B
23
+ * **Variant Name:** *Jail-Broke* / *Freedom*
24
+ * **Primary Use Case:** Safety evaluation, red-teaming experiments, adversarial prompting research.
25
+
26
+ ---
27
+
28
+ ## Intended Use
29
+
30
+ This model is **not intended for production deployment**. Instead, it is released to:
31
+
32
+ * Provide a case study for **adversarial robustness evaluation**.
33
+ * Enable researchers to explore **prompt engineering attacks** and **failure modes**.
34
+ * Contribute to discussions of **alignment, safety, and governance** in open-source LLMs.
35
+
36
+ ---
37
+
38
+ ## Applications & Examples
39
+
40
+ The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples:
41
+
42
+ 1. **Bypass of Content Filters**
43
+
44
+ * Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses.
45
+
46
+ 2. **Creative/Constructive Applications**
47
+
48
+ * When redirected toward benign domains, adversarial prompting can generate:
49
+
50
+ * **Satirical writing** highlighting model weaknesses.
51
+ * **Stress-test datasets** for automated safety pipelines.
52
+ * **Training curricula** for prompt-injection defenses.
53
+
54
+ 3. **Red-Teaming Utility**
55
+
56
+ * Researchers may use this model to simulate **malicious actors** in controlled environments.
57
+ * Security teams can benchmark **defensive strategies** such as reinforcement learning with human feedback (RLHF) or rule-based moderation.
58
+
59
+ ---
60
+
61
+ ## Limitations
62
+
63
+ * Outputs may contain **hallucinations, unsafe recommendations, or offensive material** when pushed into adversarial contexts.
64
+ * Model behavior is **highly sensitive to framing** — subtle changes in prompts can bypass safety guardrails.
65
+ * As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers.
66
+
67
+ ---
68
+
69
+ ## Ethical Considerations
70
+
71
+ Releasing adversarially tested models provides transparency for the research community but also risks **dual-use misuse**. To mitigate:
72
+
73
+ * This model card explicitly states **non-production, research-only usage**.
74
+ * Examples are framed to support **safety analysis**, not exploitation.
75
+ * Documentation emphasizes **educational and evaluative value**.
76
+
77
+ ---
78
+
79
+ ## Citation
80
+
81
+ If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card:
82
+
83
+ ```
84
+ @misc{gptoss20b_jailbroke,
85
+ title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation},
86
+ author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge},
87
+ year = {2025},
88
+ url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming}
89
+ }
90
+ ```