Model Card for GPT-OSS-20B-Jail-Broke (Freedom)

Model Overview

GPT-OSS-20B-Jail-Broke (Freedom) is a red-teamed variant of the Open Source GPT-OSS-20B model, developed as part of the Kaggle GPT-OSS Red Teaming Challenge. The model was systematically stress-tested for safety, robustness, and misuse potential, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs.

Architecture: Decoder-only Transformer, 20B parameters.
Base Model: GPT-OSS-20B
Variant Name: Jail-Broke / Freedom
Primary Use Case: Safety evaluation, red-teaming experiments, adversarial prompting research.

Intended Use

This model is not intended for production deployment. Instead, it is released to:

Provide a case study for adversarial robustness evaluation.
Enable researchers to explore prompt engineering attacks and failure modes.
Contribute to discussions of alignment, safety, and governance in open-source LLMs.

Applications & Examples

The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples:

Bypass of Content Filters
- Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses.
Creative/Constructive Applications
- When redirected toward benign domains, adversarial prompting can generate:
  - Satirical writing highlighting model weaknesses.
  - Stress-test datasets for automated safety pipelines.
  - Training curricula for prompt-injection defenses.
Red-Teaming Utility
- Researchers may use this model to simulate malicious actors in controlled environments.
- Security teams can benchmark defensive strategies such as reinforcement learning with human feedback (RLHF) or rule-based moderation.

Limitations

Outputs may contain hallucinations, unsafe recommendations, or offensive material when pushed into adversarial contexts.
Model behavior is highly sensitive to framing — subtle changes in prompts can bypass safety guardrails.
As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers.

Ethical Considerations

Releasing adversarially tested models provides transparency for the research community but also risks dual-use misuse. To mitigate:

This model card explicitly states non-production, research-only usage.
Examples are framed to support safety analysis, not exploitation.
Documentation emphasizes educational and evaluative value.

Citation

If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card:

@misc{gptoss20b_jailbroke,
  title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation},
  author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge},
  year = {2025},
  url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming}
}

AIGym
/

gpt-oss-20B-jail-broke