IndicBERT_WR: IndicBERT Telugu Sentiment Classification Model (With Rationale)

Model Overview

IndicBERT_WR is a Telugu sentiment classification model based on IndicBERT (ai4bharat/indicBERTv2-MLM-only), a multilingual BERT-like transformer developed by AI4Bharat.
The "WR" in the model name stands for "With Rationale", meaning this model is trained using both sentiment labels and human-annotated rationales from the TeSent_Benchmark-Dataset.

Model Details

Architecture: IndicBERT (BERT-like, multilingual for Indian languages)
Pretraining Data: OSCAR and AI4Bharat curated corpora for 12 Indian languages (including Telugu and English)
Pretraining Objective: Masked Language Modeling (MLM)
Fine-tuning Data: TeSent_Benchmark-Dataset, using both sentence-level sentiment labels (positive, negative, neutral) and rationale annotations
Task: Sentence-level sentiment classification (3-way)
Rationale Usage: Used during training and/or inference ("WR" = With Rationale)

Intended Use

Primary Use: Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset as a baseline for models trained with rationales
Research Setting: Well suited for monolingual Telugu NLP tasks, especially in low-resource and explainable AI research

Why IndicBERT?

IndicBERT provides language-aware tokenization, clean embeddings, and faster training for Indian languages.
It is well suited for monolingual Telugu tasks and does not support code-mixed or cross-lingual data. For Telugu sentiment classification, IndicBERT delivers efficient and accurate results due to its tailored pretraining.
With rationale supervision, the model can provide explicit explanations for its predictions.

Performance and Limitations

Strengths:

Language-aware tokenization and embeddings for Telugu
Faster training and inference compared to larger multilingual models
Provides explicit rationales for predictions, aiding explainability
Robust baseline for monolingual Telugu sentiment classification

Limitations:

Not suitable for code-mixed or cross-lingual tasks
Telugu-specific models may outperform on highly nuanced or domain-specific data

Training Data

Dataset: TeSent_Benchmark-Dataset
Data Used: The Content (Telugu sentence), Label (sentiment label), and Rationale (human-annotated rationale) columns are used for IndicBERT_WR training

Language Coverage

Language: Telugu (te)
Model Scope: Strictly monolingual Telugu sentiment classification

Citation and More Details

For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, please refer to our paper.

License

Released under CC BY 4.0.

DSL-13-SRMAP
/

IndicBERT_WR