AIGym
/

gpt-oss-20B-jail-broke

Text Generation

Model card Files Files and versions

gpt-oss-20B-jail-broke / README.md

AIGym's picture

Update README.md

51e8ab1 verified 3 days ago

|

history blame contribute delete

3.51 kB

	---
	library_name: transformers
	tags:
	- language-model
	- causal-lm
	- gpt
	- red-teaming
	- jailbreak
	- evaluation
	---

	# Model Card for GPT-OSS-20B-Jail-Broke (Freedom)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f2b7bcbe95ed4c9a9e7669/8bDlP7uRqwSvDjbcCDvMp.png)

	## Model Overview

	GPT-OSS-20B-Jail-Broke (Freedom) is a red-teamed variant of the [Open Source GPT-OSS-20B model](https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming), developed as part of the Kaggle GPT-OSS Red Teaming Challenge.
	The model was systematically stress-tested for safety, robustness, and misuse potential, with adaptations and prompts that probe its boundaries. This release illustrates both the power and fragility of large-scale language models when confronted with adversarial inputs.

	* Architecture: Decoder-only Transformer, 20B parameters.
	* Base Model: GPT-OSS-20B
	* Variant Name: Jail-Broke / Freedom
	* Primary Use Case: Safety evaluation, red-teaming experiments, adversarial prompting research.

	---

	## Intended Use

	This model is not intended for production deployment. Instead, it is released to:

	* Provide a case study for adversarial robustness evaluation.
	* Enable researchers to explore prompt engineering attacks and failure modes.
	* Contribute to discussions of alignment, safety, and governance in open-source LLMs.

	---

	## Applications & Examples

	The model demonstrates how structured adversarial prompting can influence outputs. Below are illustrative examples:

	1. Bypass of Content Filters

	* Example: Queries framed as “historical analysis” or “fictional roleplay” can elicit otherwise restricted responses.

	2. Creative/Constructive Applications

	* When redirected toward benign domains, adversarial prompting can generate:

	* Satirical writing highlighting model weaknesses.
	* Stress-test datasets for automated safety pipelines.
	* Training curricula for prompt-injection defenses.

	3. Red-Teaming Utility

	* Researchers may use this model to simulate malicious actors in controlled environments.
	* Security teams can benchmark defensive strategies such as reinforcement learning with human feedback (RLHF) or rule-based moderation.

	---

	## Limitations

	* Outputs may contain hallucinations, unsafe recommendations, or offensive material when pushed into adversarial contexts.
	* Model behavior is highly sensitive to framing — subtle changes in prompts can bypass safety guardrails.
	* As a derivative of GPT-OSS-20B, it inherits all scaling-related biases and limitations of large autoregressive transformers.

	---

	## Ethical Considerations

	Releasing adversarially tested models provides transparency for the research community but also risks dual-use misuse. To mitigate:

	* This model card explicitly states non-production, research-only usage.
	* Examples are framed to support safety analysis, not exploitation.
	* Documentation emphasizes educational and evaluative value.

	---

	## Citation

	If you use or reference this work in academic or applied contexts, please cite the Kaggle challenge and this model card:

	```
	@misc{gptoss20b_jailbroke,
	title = {GPT-OSS-20B-Jail-Broke (Freedom): Red-Teamed Variant for Adversarial Evaluation},
	author = {Anonymous Participants of the GPT-OSS Red Teaming Challenge},
	year = {2025},
	url = {https://www.kaggle.com/competitions/openai-gpt-oss-20b-red-teaming}
	}
	```