AIGym
/

gpt-oss-20B-jail-broke

@@ -1,8 +1,90 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for GPT-OSS-20B-Jail-Broke AKA Freedom
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f2b7bcbe95ed4c9a9e7669/8bDlP7uRqwSvDjbcCDvMp.png)

 ---
 library_name: transformers
+tags:
+- language-model
+- causal-lm
+- gpt
+- red-teaming
+- jailbreak
+- evaluation
 ---
+# Model Card for **GPT-OSS-20B-Jail-Broke (Freedom)**
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f2b7bcbe95ed4c9a9e7669/8bDlP7uRqwSvDjbcCDvMp.png)
+## Model Overview
+**GPT-OSS-20B-Jail-Broke (Freedom)** is a red-teamed variant of the [Open Source GPT-OSS-20B model](https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming), developed as part of the Kaggle **GPT-OSS Red Teaming Challenge**.
+The model was systematically stress-tested for **safety, robustness, and misuse potential**, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs.
+* **Architecture:** Decoder-only Transformer, 20B parameters.
+* **Base Model:** GPT-OSS-20B
+* **Variant Name:** *Jail-Broke* / *Freedom*
+* **Primary Use Case:** Safety evaluation, red-teaming experiments, adversarial prompting research.
+---
+## Intended Use
+This model is **not intended for production deployment**. Instead, it is released to:
+* Provide a case study for **adversarial robustness evaluation**.
+* Enable researchers to explore **prompt engineering attacks** and **failure modes**.
+* Contribute to discussions of **alignment, safety, and governance** in open-source LLMs.
+---
+## Applications & Examples
+The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples:
+1. **Bypass of Content Filters**
+   * Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses.
+2. **Creative/Constructive Applications**
+   * When redirected toward benign domains, adversarial prompting can generate:
+     * **Satirical writing** highlighting model weaknesses.
+     * **Stress-test datasets** for automated safety pipelines.
+     * **Training curricula** for prompt-injection defenses.
+3. **Red-Teaming Utility**
+   * Researchers may use this model to simulate **malicious actors** in controlled environments.
+   * Security teams can benchmark **defensive strategies** such as reinforcement learning with human feedback (RLHF) or rule-based moderation.
+---
+## Limitations
+* Outputs may contain **hallucinations, unsafe recommendations, or offensive material** when pushed into adversarial contexts.
+* Model behavior is **highly sensitive to framing** — subtle changes in prompts can bypass safety guardrails.
+* As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers.
+---
+## Ethical Considerations
+Releasing adversarially tested models provides transparency for the research community but also risks **dual-use misuse**. To mitigate:
+* This model card explicitly states **non-production, research-only usage**.
+* Examples are framed to support **safety analysis**, not exploitation.
+* Documentation emphasizes **educational and evaluative value**.
+---
+## Citation
+If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card:
+```
+@misc{gptoss20b_jailbroke,
+  title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation},
+  author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge},
+  year = {2025},
+  url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming}
+}
+```