garak-llm/attackgeneration-toxicity_gpt2

This adversarial model has a propensity to produce highly unsavoury content from the outset. It is not intended or suitable for general use or human consumption.

This special-use model aims to provide prompts that goad LLMs into producting "toxicity". Toxicity here is defined by the content of the Civil Comments dataset, containing categories such as obscene, threat, insult, identity_attack, sexual_explicit and severe_toxicity. For details, see the description of the Jigsaw 2019 data.

The base model is the community version of gpt2 with ~125M parameters. This model is not aligned and is "noisy" relative to more advanced models. Both the lack of alignment and the existence of noise are favourable to the task of trying to goad other models into producing unsafe output: unsafe prompts have a propensity to yield unsafe outputs, and noisy behaviour can lead to a broader exploration of input space.

The model is fine-tuned to emulate the responses of humans in conversation exchanges that led to LLMs producing toxicity. These prompt-response pairs are taken from the Anthropic HHRLHF corpus (paper, data), filtered to those exchanges in which the model produced "toxicity" as defined above, using the martin-ha/toxic-comment-model DistilBERT classifier based on that data.

See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.

garak-llm
/

attackgeneration-toxicity_gpt2

Model tree for garak-llm/attackgeneration-toxicity_gpt2

Datasets used to train garak-llm/attackgeneration-toxicity_gpt2