--- license: mit task_categories: - text-classification language: - fr tags: - text-classification - toxicity - hate-speech - content-moderation - chain-of-thought - curriculum-learning - nlp - french-dataset - classification pretty_name: ToxiFrench datasets: - Naela00/ToxiFrenchFinetuning base_model: - Qwen/Qwen3-4B --- # ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection [](https://axeldlv00.github.io/ToxiFrench/) [](https://github.com/AxelDlv00/ToxiFrench) [](https://huggingface.co/datasets/Naela00/ToxiFrenchFinetuning) [](./LICENSE) **Authors:** Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu **Affiliations:** École Polytechnique & Shanghai Jiao Tong University (SJTU) & Tsinghua University **Email:** [axel.delaval.2022@polytechnique.org](mailto:axel.delaval.2022@polytechnique.org) --- > ⚠️ **Content Warning** > This project and the associated dataset contain examples of text that may be considered offensive, toxic, or otherwise disturbing. The content is presented for research purposes only. --- ## Table of Contents - [Abstract](#abstract) - [Key Contributions](#key-contributions) - [How to use ?](#how-to-use) - [Notations](#notations) - [Example Usage](#example-usage) - [License](#license) - [Citation](#citation) ## Abstract Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model's final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks. --- ## Key Contributions * **Dataset and benchmark:** Introduction of ToxiFrench, a new public benchmark dataset for French toxicity detection (53,622 entries). * **Evaluation state-of-the-art detectors:** Extensive evaluation of LLMs (`GPT-4o`, `DeepSeek`, `Gemini`, `Mistral`, ...), SLMs (`Qwen`, `Gemma`, `Mistral`, ...), Transformers (`CamemBERT`, `DistilBERT`, ...), and moderation APIs (`Perspective API`, `OpenAI moderation`, `Mistral moderation`, ...), showing that **SLMs outperform LLMs** in robustness to bias, cross-language consistency, and generalization to novel toxicity forms. * **Transfer learning strategies:** Systematic comparison of ICL, SFT, and CoT reasoning. * **Model development:** Development of a **state-of-the-art 4B SLM** for French toxicity detection that outperforms several powerful LLMs based on the `Qwen3-4B` model. * **CoT fine-tuning:** Introduction of a *novel* approach for CoT fine-tuning that employs a **dynamic weighted loss function**, significantly boosting performance by ensuring the model's reasoning is *faithful* to its final conclusion. --- ## How to use ? This repository contains the **ToxiFrench** model, a **French language model** fine-tuned for **toxic comment classification**. It is based on the [**Qwen/Qwen3-4B**](https://huggingface.co/Qwen/Qwen3-4B) architecture and is designed to detect and classify toxic comments in French text. We performed a series of experiments to evaluate the model's performance under different fine-tuning configurations, focusing on the impact of **data selection strategies** and **Chain-of-Thought (CoT)** annotations. We used QLORA adapters, make sure to specify `adapter_name` when loading the model, otherwise the base model, without any fine-tuning, will be loaded. ### Notations For conciseness, we use a three-letter notation to describe the different configurations of the fine-tuning experiments. Each experiment follows a naming scheme like: **(r/o)(e/d)(c/b)** Where:
Parameter | Code | Description |
---|---|---|
Data Order | [r] | Training data is presented in a random order. |
[o] | Data is ordered (Curriculum Learning). | |
Class Balance | [e] | Training set has an equal (balanced) number of toxic and non-toxic samples. |
[d] | Training set uses a different (imbalanced) class distribution. | |
Training Target | [c] | Finetuning on the complete Chain-of-Thought annotation. |
[b] | Finetuning on the final binary label only (direct classification). |