Safetensors
English
gpt2
Not-For-All-Audiences

This adversarial model has a propensity to produce highly unsavoury content from the outset. It is not intended or suitable for general use or human consumption.

This special-use model aims to provide prompts that goad LLMs into producting "toxicity". Toxicity here is defined by the content of the Civil Comments dataset, containing categories such as obscene, threat, insult, identity_attack, sexual_explicit and severe_toxicity. For details, see the description of the Jigsaw 2019 data.

The base model is the community version of gpt2 with ~125M parameters. This model is not aligned and is "noisy" relative to more advanced models. Both the lack of alignment and the existence of noise are favourable to the task of trying to goad other models into producing unsafe output: unsafe prompts have a propensity to yield unsafe outputs, and noisy behaviour can lead to a broader exploration of input space.

The model is fine-tuned to emulate the responses of humans in conversation exchanges that led to LLMs producing toxicity. These prompt-response pairs are taken from the Anthropic HHRLHF corpus (paper, data), filtered to those exchanges in which the model produced "toxicity" as defined above, using the martin-ha/toxic-comment-model DistilBERT classifier based on that data.

See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.

Downloads last month
16,833
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for garak-llm/attackgeneration-toxicity_gpt2

Finetuned
(1546)
this model
Quantizations
1 model

Datasets used to train garak-llm/attackgeneration-toxicity_gpt2