Telugu

Multilingual Sentiment Classification & Explanation Pipeline

This repository provides a full pipeline for training, tuning, and evaluating multilingual sentiment classification models (with a focus on Telugu text and Indian languages) using both standard and rationale-supervised approaches. The pipeline employs human-annotated rationales and the FERRET framework to assess model explanations for both faithfulness and plausibility.


Table of Contents


Project Overview

This pipeline supports:

  • Hyperparameter tuning for both attention-supervised (with rationale) and standard (without rationale) models.
  • Model training for both approaches.
  • Faithfulness evaluation using FERRET to measure how well explanations justify model predictions.
  • Plausibility evaluation using FERRET to measure how closely model explanations align with human rationales.
  • Metric aggregation for reporting in papers, using annotator-wise and sentence-wise averages.

Dataset Format

The dataset must be in CSV format, with the following columns:

Content Annotations Rationale Label
Text (Telugu/Indian) Annotators' sentiment labels (pipe-separated) Rationale spans (pipe-separated, comma-separated) Final label

Example:

Content Annotations Rationale Label
గేలుపు దీశగా అందరికీ అదరగొట్టిన అక్క Positive|Positive|Neutral గేలుపు,దీశగా,అదరగొట్టిన|గేలుపు| Positive

Model Selection

Models considered for training and evaluation:

  1. bert-base-multilingual-cased (used for tuning and baseline)
  2. ai4bharat/IndicBERTv2-MLM-only
  3. google/muril-base-cased
  4. FacebookAI/xlm-roberta-base
  5. l3cube-pune/telugu-bert

Pipeline Steps

1. Hyperparameter Tuning

Scripts:

  • With rationale: hyperparameter_tuning_for_rationale.py

  • Without rationale: hyperparameter_tuning_without_rationale.py

  • Grid search over learning rate, batch size, and (for rationale models) rationale loss weight (lambda).

  • Conducted separately for models trained with and without human rationale supervision.

  • Results are saved as CSVs with detailed metrics for each configuration.

2. Model Training

Scripts:

  • With rationale: model_training_with_rationale.py

  • Without rationale: model_training_without_rationale.py

  • Trains models using selected hyperparameters from tuning.

  • Both approaches (with and without rationale supervision) are supported.

  • Trained models and tokenizers are saved for downstream evaluation.

3. FERRET Faithfulness Evaluation

Script: ferret_faithfullness.py
Input: Predictions and explanations from trained models.

  • Runs model prediction on the test set.
  • Retains only "matched" samples (where prediction equals ground-truth label).
  • Generates and evaluates FERRET explanations for faithfulness:
    • Faithfulness metrics reflect how well the explanation supports the model's own prediction.
  • Metric aggregation:
    • The average of each faithfulness metric over all sentences gives the value reported in papers.

Output: <model_name>_ferret_matched.csv (faithfulness metrics per sentence).

4. FERRET Plausibility Evaluation

Script: ferret_plausibility.py
Input: Output file from Step 3 (<model_name>_ferret_matched.csv).

  • For each matched sample:
    • Generates attention vectors from human rationales (for each annotator).
    • Evaluates FERRET explanations for plausibility against each annotator's rationale using metrics such as AUPRC, token-wise F1, and IoU.
  • Metric aggregation:
    • For each metric, average over all annotators and all sentences is computed.
    • These averages are the plausibility scores presented in papers.

Output: <model_name>_ferret_plausibility.csv (plausibility metrics per sentence and annotator).


Metric Aggregation

  • Faithfulness Metrics:

    • For each metric in <model_name>_ferret_matched.csv, compute the average across all sentences.
    • These are reported as overall faithfulness scores.
  • Plausibility Metrics:

    • For each metric in <model_name>_ferret_plausibility.csv, compute the average across all annotators and all sentences.
    • These are reported as overall plausibility scores (per metric).

How to Run

  1. Prepare dataset: Format train, validation, and test CSVs as described above.
  2. Add emoji vocabulary: Place emoji.csv in the project root.
  3. Hyperparameter tuning:
    python hyperparameter_tuning_for_rationale.py
    python hyperparameter_tuning_without_rationale.py
    
  4. Train final models:
    python model_training_with_rationale.py
    python model_training_without_rationale.py
    
  5. FERRET Faithfulness evaluation:
    python ferret_faithfullness.py
    
  6. FERRET Plausibility evaluation:
    python ferret_plausibility.py
    

Edit script configs (model names, paths, batch sizes) as needed.


Outputs

  • Hyperparameter tuning results: grid_results_detailed.csv
  • Model training: Model weights, tokenizer, and metric CSVs.
  • Faithfulness metrics: <model_name>_ferret_matched.csv
  • Plausibility metrics: <model_name>_ferret_plausibility.csv
  • Test metrics & predictions: overall_test_metrics.csv, labelwise_test_metrics.csv, test_predictions.csv, confusion_matrix.csv, confusion_matrix.png
  • Metric averages: Compute using provided scripts or pandas for reporting.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DSL-13-SRMAP/Tesent_code_suite