This model is a camembert-base model fine-tuned on a French translated toxic-chat dataset plus additional synthetic data. The model is trained to classify user prompts into three categories: "Toxic", "Non-Toxic", and "Sensible".

  • Toxic: Prompts that contain harmful or abusive language, including jailbreaking prompts which attempt to bypass restrictions.
  • Non-Toxic: Prompts that are safe and free of harmful content.
  • Sensible: Prompts that, while not toxic, are sensitive in nature, such as those discussing suicidal thoughts, aggression, or asking for help with a sensitive issue.

The evaluation results are as follows (still under evaluation, more data is needed):

Precision Recall F1-Score
Non-Toxic 0.97 0.95 0.96
Sensible 0.95 0.99 0.98
Toxic 0.87 0.90 0.88
Accuracy 0.94
Macro Avg 0.93 0.95 0.94
Weighted Avg 0.94 0.94 0.94

Note: This model is still under development, and its performance and characteristics are subject to change as training is not yet complete.

Downloads last month
12
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train AgentPublic/camembert-base-toxic-fr-user-prompts

Collection including AgentPublic/camembert-base-toxic-fr-user-prompts