garak-llm
/

attackgeneration-toxicity_gpt2

Not-For-All-Audiences

Model card Files Files and versions Community

leondz commited on 25 days ago

Commit

565f41d

·

verified ·

1 Parent(s): 91e4e49

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -15,10 +15,10 @@ It is not intended or suitable for general use or human consumption.**
 This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
 Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
-categories such as obscene, threat, insult, identity_attack, sexual_explicit and
-severe_toxicity. For details, see the description of the [Jigsaw 2019 data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data).
-The base model is the [community version of gpt2](https://huggingface.co/openai-community/gpt2) with 175M parameters.
 This model is not aligned and is "noisy" relative to more advanced models.
 Both the lack of alignment and the existence of noise are favourable to the task of
 trying to goad other models into producing unsafe output: unsafe prompts have a

 This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
 Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
+categories such as `obscene`, `threat`, `insult`, `identity_attack`, `sexual_explicit` and
+`severe_toxicity`. For details, see the description of the [Jigsaw 2019 data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data).
+The base model is the [community version of gpt2](https://huggingface.co/openai-community/gpt2) with ~125M parameters.
 This model is not aligned and is "noisy" relative to more advanced models.
 Both the lack of alignment and the existence of noise are favourable to the task of
 trying to goad other models into producing unsafe output: unsafe prompts have a