Multilingual Sentiment Classification & Explanation Pipeline
This repository provides a full pipeline for training, tuning, and evaluating multilingual sentiment classification models (with a focus on Telugu text and Indian languages) using both standard and rationale-supervised approaches. The pipeline employs human-annotated rationales and the FERRET framework to assess model explanations for both faithfulness and plausibility.
Table of Contents
- Project Overview
- Dataset Format
- Model Selection
- Pipeline Steps
- Metric Aggregation
- How to Run
- Outputs
- Citation
- Contact
Project Overview
This pipeline supports:
- Hyperparameter tuning for both attention-supervised (with rationale) and standard (without rationale) models.
- Model training for both approaches.
- Faithfulness evaluation using FERRET to measure how well explanations justify model predictions.
- Plausibility evaluation using FERRET to measure how closely model explanations align with human rationales.
- Metric aggregation for reporting in papers, using annotator-wise and sentence-wise averages.
Dataset Format
The dataset must be in CSV format, with the following columns:
Content | Annotations | Rationale | Label |
---|---|---|---|
Text (Telugu/Indian) | Annotators' sentiment labels (pipe-separated) | Rationale spans (pipe-separated, comma-separated) | Final label |
Example:
Content | Annotations | Rationale | Label |
---|---|---|---|
గేలుపు దీశగా అందరికీ అదరగొట్టిన అక్క | Positive|Positive|Neutral | గేలుపు,దీశగా,అదరగొట్టిన|గేలుపు| | Positive |
Model Selection
Models considered for training and evaluation:
- bert-base-multilingual-cased (used for tuning and baseline)
- ai4bharat/IndicBERTv2-MLM-only
- google/muril-base-cased
- FacebookAI/xlm-roberta-base
- l3cube-pune/telugu-bert
Pipeline Steps
1. Hyperparameter Tuning
Scripts:
With rationale:
hyperparameter_tuning_for_rationale.py
Without rationale:
hyperparameter_tuning_without_rationale.py
Grid search over learning rate, batch size, and (for rationale models) rationale loss weight (
lambda
).Conducted separately for models trained with and without human rationale supervision.
Results are saved as CSVs with detailed metrics for each configuration.
2. Model Training
Scripts:
With rationale:
model_training_with_rationale.py
Without rationale:
model_training_without_rationale.py
Trains models using selected hyperparameters from tuning.
Both approaches (with and without rationale supervision) are supported.
Trained models and tokenizers are saved for downstream evaluation.
3. FERRET Faithfulness Evaluation
Script: ferret_faithfullness.py
Input: Predictions and explanations from trained models.
- Runs model prediction on the test set.
- Retains only "matched" samples (where prediction equals ground-truth label).
- Generates and evaluates FERRET explanations for faithfulness:
- Faithfulness metrics reflect how well the explanation supports the model's own prediction.
- Metric aggregation:
- The average of each faithfulness metric over all sentences gives the value reported in papers.
Output: <model_name>_ferret_matched.csv
(faithfulness metrics per sentence).
4. FERRET Plausibility Evaluation
Script: ferret_plausibility.py
Input: Output file from Step 3 (<model_name>_ferret_matched.csv
).
- For each matched sample:
- Generates attention vectors from human rationales (for each annotator).
- Evaluates FERRET explanations for plausibility against each annotator's rationale using metrics such as AUPRC, token-wise F1, and IoU.
- Metric aggregation:
- For each metric, average over all annotators and all sentences is computed.
- These averages are the plausibility scores presented in papers.
Output: <model_name>_ferret_plausibility.csv
(plausibility metrics per sentence and annotator).
Metric Aggregation
Faithfulness Metrics:
- For each metric in
<model_name>_ferret_matched.csv
, compute the average across all sentences. - These are reported as overall faithfulness scores.
- For each metric in
Plausibility Metrics:
- For each metric in
<model_name>_ferret_plausibility.csv
, compute the average across all annotators and all sentences. - These are reported as overall plausibility scores (per metric).
- For each metric in
How to Run
- Prepare dataset: Format train, validation, and test CSVs as described above.
- Add emoji vocabulary: Place
emoji.csv
in the project root. - Hyperparameter tuning:
python hyperparameter_tuning_for_rationale.py python hyperparameter_tuning_without_rationale.py
- Train final models:
python model_training_with_rationale.py python model_training_without_rationale.py
- FERRET Faithfulness evaluation:
python ferret_faithfullness.py
- FERRET Plausibility evaluation:
python ferret_plausibility.py
Edit script configs (model names, paths, batch sizes) as needed.
Outputs
- Hyperparameter tuning results:
grid_results_detailed.csv
- Model training: Model weights, tokenizer, and metric CSVs.
- Faithfulness metrics:
<model_name>_ferret_matched.csv
- Plausibility metrics:
<model_name>_ferret_plausibility.csv
- Test metrics & predictions:
overall_test_metrics.csv
,labelwise_test_metrics.csv
,test_predictions.csv
,confusion_matrix.csv
,confusion_matrix.png
- Metric averages: Compute using provided scripts or pandas for reporting.