File size: 10,754 Bytes
481f266
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1309846
7b8b20e
 
 
 
 
 
 
1309846
7b8b20e
911d604
7b8b20e
911d604
7b8b20e
 
 
 
 
 
 
 
2419007
 
 
 
 
 
 
 
 
7b8b20e
 
1309846
7b8b20e
 
 
 
 
 
 
 
 
 
 
 
 
2419007
481f266
 
 
 
 
2419007
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03d7ca5
2419007
 
 
03d7ca5
 
2419007
 
03d7ca5
2419007
 
 
 
 
 
 
03d7ca5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2419007
 
 
 
 
 
03d7ca5
2419007
 
03d7ca5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2419007
7b8b20e
 
 
 
 
03d7ca5
 
7b8b20e
 
 
 
481f266
 
7b8b20e
 
 
 
1309846
 
7b8b20e
481f266
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: mit
task_categories:
- text-classification
language:
- fr
tags:
- text-classification
- toxicity
- hate-speech
- content-moderation
- chain-of-thought
- curriculum-learning
- nlp
- french-dataset
- classification
pretty_name: ToxiFrench
datasets:
- Naela00/ToxiFrenchFinetuning
base_model:
- Qwen/Qwen3-4B
---
# ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

<!-- Badges/Tags -->
[![GitHub Pages](https://img.shields.io/badge/GitHub%20Pages-Deployed-brightgreen?style=flat-square&logo=github)](https://axeldlv00.github.io/ToxiFrench/)
[![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-blue?style=flat-square&logo=github)](https://github.com/AxelDlv00/ToxiFrench)
[![Hugging Face Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-blue?style=flat-square&logo=huggingface)](https://huggingface.co/datasets/Naela00/ToxiFrenchFinetuning)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](./LICENSE)

**Authors:** Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu

**Affiliations:** École Polytechnique & Shanghai Jiao Tong University (SJTU) & Tsinghua University

**Email:** [[email protected]](mailto:[email protected])

---

> ⚠️ **Content Warning**
> This project and the associated dataset contain examples of text that may be considered offensive, toxic, or otherwise disturbing. The content is presented for research purposes only.

---

## Table of Contents
- [Abstract](#abstract)
- [Key Contributions](#key-contributions)
- [How to use ?](#how-to-use)
    - [Notations](#notations)
    - [Example Usage](#example-usage)
- [License](#license)
- [Citation](#citation)

## Abstract

Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model's final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks.

---

## Key Contributions

* **Dataset and benchmark:** Introduction of ToxiFrench, a new public benchmark dataset for French toxicity detection (53,622 entries).
* **Evaluation state-of-the-art detectors:** Extensive evaluation of LLMs (`GPT-4o`, `DeepSeek`, `Gemini`, `Mistral`, ...), SLMs (`Qwen`, `Gemma`, `Mistral`, ...), Transformers (`CamemBERT`, `DistilBERT`, ...), and moderation APIs (`Perspective API`, `OpenAI moderation`, `Mistral moderation`, ...), showing that **SLMs outperform LLMs** in robustness to bias, cross-language consistency, and generalization to novel toxicity forms.
* **Transfer learning strategies:** Systematic comparison of ICL, SFT, and CoT reasoning.
* **Model development:** Development of a **state-of-the-art 4B SLM** for French toxicity detection that outperforms several powerful LLMs based on the `Qwen3-4B` model.
* **CoT fine-tuning:** Introduction of a *novel* approach for CoT fine-tuning that employs a **dynamic weighted loss function**, significantly boosting performance by ensuring the model's reasoning is *faithful* to its final conclusion.

---

## How to use ?

This repository contains the **ToxiFrench** model, a **French language model** fine-tuned for **toxic comment classification**. It is based on the [**Qwen/Qwen3-4B**](https://huggingface.co/Qwen/Qwen3-4B) architecture and is designed to detect and classify toxic comments in French text.

We performed a series of experiments to evaluate the model's performance under different fine-tuning configurations, focusing on the impact of **data selection strategies** and **Chain-of-Thought (CoT)** annotations.

We used QLORA adapters, make sure to specify `adapter_name` when loading the model, otherwise the base model, without any fine-tuning, will be loaded.

### Notations

For conciseness, we use a three-letter notation to describe the different configurations of the fine-tuning experiments. Each experiment follows a naming scheme like: **(<strong style="color: #d9534f;">r</strong>/<strong style="color: #428bca;">o</strong>)(<strong style="color: #d9534f;">e</strong>/<strong style="color: #428bca;">d</strong>)(<strong style="color: #d9534f;">c</strong>/<strong style="color: #428bca;">b</strong>)**  
Where: 

<table style="width:100%; border-collapse: collapse;">
  <thead>
    <tr>
      <th style="text-align:left; padding: 8px; border-bottom: 2px solid black;">Parameter</th>
      <th style="text-align:left; padding: 8px; border-bottom: 2px solid black;">Code</th>
      <th style="text-align:left; padding: 8px; border-bottom: 2px solid black;">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="2" style="padding: 8px; border-bottom: 1px solid #ddd;"><strong>Data Order</strong></td>
      <td style="padding: 8px; color: #d9534f;">[r]</td>
      <td style="padding: 8px;">Training data is presented in a <strong style="color: #d9534f;">random</strong> order.</td>
    </tr>
    <tr>
      <td style="padding: 8px; border-bottom: 1px solid #ddd; color: #428bca;">[o]</td>
      <td style="padding: 8px; border-bottom: 1px solid #ddd;">Data is <strong style="color: #428bca;">ordered</strong> (Curriculum Learning).</td>
    </tr>
    <tr>
      <td rowspan="2" style="padding: 8px; border-bottom: 1px solid #ddd;"><strong>Class Balance</strong></td>
      <td style="padding: 8px; color: #d9534f;">[e]</td>
      <td style="padding: 8px;">Training set has an <strong style="color: #d9534f;">equal</strong> (balanced) number of toxic and non-toxic samples.</td>
    </tr>
    <tr>
      <td style="padding: 8px; border-bottom: 1px solid #ddd; color: #428bca;">[d]</td>
      <td style="padding: 8px; border-bottom: 1px solid #ddd;">Training set uses a <strong style="color: #428bca;">different</strong> (imbalanced) class distribution.</td>
    </tr>
    <tr>
      <td rowspan="2" style="padding: 8px;"><strong>Training Target</strong></td>
      <td style="padding: 8px; color: #d9534f;">[c]</td>
      <td style="padding: 8px;">Finetuning on the complete <strong style="color: #d9534f;">Chain-of-Thought</strong> annotation.</td>
    </tr>
    <tr>
      <td style="padding: 8px; color: #428bca;">[b]</td>
      <td style="padding: 8px;">Finetuning on the final <strong style="color: #428bca;">binary</strong> label only (direct classification).</td>
    </tr>
  </tbody>
</table>

> e.g. `rec` is the model trained on an oversampled dataset for balance (`e`), with batches in an arbitrary order (`r`), and with CoT reasoning (`c`).

### Example Usage

You can find an example in [this notebook](example_use.ipynb).

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Choose which adapter to load
target_adapter_name = "rec" # Among the following six configurations : "odc", "oeb", "oec", "rdc", "reb", "rec"

# Load the base model
base_model_name = "Qwen/Qwen3-4B"

# For small GPUs, use 4-bit quantization
bnb_config = BitsAndBytesConfig(**{
            "load_in_4bit": True,
            "bnb_4bit_use_double_quant": True,
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_compute_dtype": torch.float16
        })

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
                base_model_name,
                use_fast=True,
                trust_remote_code=True
            )
tokenizer.padding_side = 'left' 

# Load model
model = AutoModelForCausalLM.from_pretrained(
                base_model_name,
                quantization_config=bnb_config,
                trust_remote_code=True,
                sliding_window=None,
            )

# Resize the model's token embeddings to match the tokenizer's vocabulary size
model_embedding_size = model.get_input_embeddings().weight.size(0)
tokenizer_vocab_size = len(tokenizer)
model.resize_token_embeddings(tokenizer_vocab_size)

# Load the specific adapter by name from the repository
adapter_repo_id = "Naela00/ToxiFrench"
model = PeftModel.from_pretrained(
    model,
    adapter_repo_id,
    subfolder=target_adapter_name # Among the following six configurations : "odc", "oeb", "oec", "rdc", "reb", "rec"
)

# Inference
message_to_analyze = "Je suis vraiment déçu par ce film, c'était nul !"
prompt = f"Message:\n{message_to_analyze}\n\nAnalyse:\n"
if "c" in target_adapter_name:
    prompt += "<think>\nExplication :\n" # If using CoT, add the reasoning part

max_new_tokens: int = 1024
do_sample: bool = True
temperature: float = 0.7
top_p: float = 0.9
top_k: int = 50
repetition_penalty: float = 1.1

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True
).to(model.device)

default_generation_kwargs = {
    "max_new_tokens": max_new_tokens,
    "do_sample": do_sample,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
    "repetition_penalty": repetition_penalty,
    "eos_token_id": tokenizer.eos_token_id,
}

outputs = model.generate(**inputs, **default_generation_kwargs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)

print(generated_text)
```

---

## License

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](./LICENSE)

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

--- 

## Citation

If you use this project in your research, please cite it as follows:

```bibtex
@misc{delaval2025toxifrench,
    title={ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection},
    author={Axel Delaval and Shujian Yang and Haicheng Wang and Han Qiu and Jialiang Lu},
    year={2025},
}
```