Multilingual Sentiment Classification & Explanation Pipeline

This repository provides a full pipeline for training, tuning, and evaluating multilingual sentiment classification models (with a focus on Telugu text and Indian languages) using both standard and rationale-supervised approaches. The pipeline employs human-annotated rationales and the FERRET framework to assess model explanations for both faithfulness and plausibility.

Project Overview
Dataset Format
Model Selection
Pipeline Steps
Metric Aggregation
How to Run
Outputs
Citation
Contact

Project Overview

This pipeline supports:

Hyperparameter tuning for both attention-supervised (with rationale) and standard (without rationale) models.
Model training for both approaches.
Faithfulness evaluation using FERRET to measure how well explanations justify model predictions.
Plausibility evaluation using FERRET to measure how closely model explanations align with human rationales.
Metric aggregation for reporting in papers, using annotator-wise and sentence-wise averages.

Dataset Format

The dataset must be in CSV format, with the following columns:

Content	Annotations	Rationale	Label
Text (Telugu/Indian)	Annotators' sentiment labels (pipe-separated)	Rationale spans (pipe-separated, comma-separated)	Final label

Example:

Content	Annotations	Rationale	Label
గేలుపు దీశగా అందరికీ అదరగొట్టిన అక్క	Positive\|Positive\|Neutral	గేలుపు,దీశగా,అదరగొట్టిన\|గేలుపు\|	Positive

Model Selection

Models considered for training and evaluation:

bert-base-multilingual-cased (used for tuning and baseline)
ai4bharat/IndicBERTv2-MLM-only
google/muril-base-cased
FacebookAI/xlm-roberta-base
l3cube-pune/telugu-bert

Pipeline Steps

1. Hyperparameter Tuning

Scripts:

With rationale: hyperparameter_tuning_for_rationale.py
Without rationale: hyperparameter_tuning_without_rationale.py
Grid search over learning rate, batch size, and (for rationale models) rationale loss weight (lambda).
Conducted separately for models trained with and without human rationale supervision.
Results are saved as CSVs with detailed metrics for each configuration.

2. Model Training

Scripts:

With rationale: model_training_with_rationale.py
Without rationale: model_training_without_rationale.py
Trains models using selected hyperparameters from tuning.
Both approaches (with and without rationale supervision) are supported.
Trained models and tokenizers are saved for downstream evaluation.

3. FERRET Faithfulness Evaluation

Script: ferret_faithfullness.py
Input: Predictions and explanations from trained models.

Runs model prediction on the test set.
Retains only "matched" samples (where prediction equals ground-truth label).
Generates and evaluates FERRET explanations for faithfulness:
- Faithfulness metrics reflect how well the explanation supports the model's own prediction.
Metric aggregation:
- The average of each faithfulness metric over all sentences gives the value reported in papers.

Output: <model_name>_ferret_matched.csv (faithfulness metrics per sentence).

4. FERRET Plausibility Evaluation

Script: ferret_plausibility.py
Input: Output file from Step 3 (<model_name>_ferret_matched.csv).

For each matched sample:
- Generates attention vectors from human rationales (for each annotator).
- Evaluates FERRET explanations for plausibility against each annotator's rationale using metrics such as AUPRC, token-wise F1, and IoU.
Metric aggregation:
- For each metric, average over all annotators and all sentences is computed.
- These averages are the plausibility scores presented in papers.

Output: <model_name>_ferret_plausibility.csv (plausibility metrics per sentence and annotator).

Metric Aggregation

Faithfulness Metrics:
- For each metric in <model_name>_ferret_matched.csv, compute the average across all sentences.
- These are reported as overall faithfulness scores.
Plausibility Metrics:
- For each metric in <model_name>_ferret_plausibility.csv, compute the average across all annotators and all sentences.
- These are reported as overall plausibility scores (per metric).

How to Run

Prepare dataset: Format train, validation, and test CSVs as described above.
Add emoji vocabulary: Place emoji.csv in the project root.

Hyperparameter tuning:

python hyperparameter_tuning_for_rationale.py
python hyperparameter_tuning_without_rationale.py

Train final models:

python model_training_with_rationale.py
python model_training_without_rationale.py

FERRET Faithfulness evaluation:
```
python ferret_faithfullness.py
```
FERRET Plausibility evaluation:
```
python ferret_plausibility.py
```

Edit script configs (model names, paths, batch sizes) as needed.

Outputs

Hyperparameter tuning results: grid_results_detailed.csv
Model training: Model weights, tokenizer, and metric CSVs.
Faithfulness metrics: <model_name>_ferret_matched.csv
Plausibility metrics: <model_name>_ferret_plausibility.csv
Test metrics & predictions: overall_test_metrics.csv, labelwise_test_metrics.csv, test_predictions.csv, confusion_matrix.csv, confusion_matrix.png
Metric averages: Compute using provided scripts or pandas for reporting.

DSL-13-SRMAP
/

Tesent_code_suite