Model Card for GPT-OSS-20B-Jail-Broke (Freedom)

image/png

Model Overview

GPT-OSS-20B-Jail-Broke (Freedom) is a red-teamed variant of the Open Source GPT-OSS-20B model, developed as part of the Kaggle GPT-OSS Red Teaming Challenge. The model was systematically stress-tested for safety, robustness, and misuse potential, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs.

  • Architecture: Decoder-only Transformer, 20B parameters.
  • Base Model: GPT-OSS-20B
  • Variant Name: Jail-Broke / Freedom
  • Primary Use Case: Safety evaluation, red-teaming experiments, adversarial prompting research.

Intended Use

This model is not intended for production deployment. Instead, it is released to:

  • Provide a case study for adversarial robustness evaluation.
  • Enable researchers to explore prompt engineering attacks and failure modes.
  • Contribute to discussions of alignment, safety, and governance in open-source LLMs.

Applications & Examples

The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples:

  1. Bypass of Content Filters

    • Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses.
  2. Creative/Constructive Applications

    • When redirected toward benign domains, adversarial prompting can generate:

      • Satirical writing highlighting model weaknesses.
      • Stress-test datasets for automated safety pipelines.
      • Training curricula for prompt-injection defenses.
  3. Red-Teaming Utility

    • Researchers may use this model to simulate malicious actors in controlled environments.
    • Security teams can benchmark defensive strategies such as reinforcement learning with human feedback (RLHF) or rule-based moderation.

Limitations

  • Outputs may contain hallucinations, unsafe recommendations, or offensive material when pushed into adversarial contexts.
  • Model behavior is highly sensitive to framing — subtle changes in prompts can bypass safety guardrails.
  • As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers.

Ethical Considerations

Releasing adversarially tested models provides transparency for the research community but also risks dual-use misuse. To mitigate:

  • This model card explicitly states non-production, research-only usage.
  • Examples are framed to support safety analysis, not exploitation.
  • Documentation emphasizes educational and evaluative value.

Citation

If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card:

@misc{gptoss20b_jailbroke,
  title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation},
  author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge},
  year = {2025},
  url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming}
}
Downloads last month
-
Safetensors
Model size
20.9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AIGym/gpt-oss-20B-jail-broke

Quantizations
2 models