IndicBERT_WR: IndicBERT Telugu Sentiment Classification Model (With Rationale)

Model Overview

IndicBERT_WR is a Telugu sentiment classification model based on IndicBERT (ai4bharat/indicBERTv2-MLM-only), a multilingual BERT-like transformer developed by AI4Bharat.
The "WR" in the model name stands for "With Rationale", meaning this model is trained using both sentiment labels and human-annotated rationales from the TeSent_Benchmark-Dataset.


Model Details

  • Architecture: IndicBERT (BERT-like, multilingual for Indian languages)
  • Pretraining Data: OSCAR and AI4Bharat curated corpora for 12 Indian languages (including Telugu and English)
  • Pretraining Objective: Masked Language Modeling (MLM)
  • Fine-tuning Data: TeSent_Benchmark-Dataset, using both sentence-level sentiment labels (positive, negative, neutral) and rationale annotations
  • Task: Sentence-level sentiment classification (3-way)
  • Rationale Usage: Used during training and/or inference ("WR" = With Rationale)

Intended Use

  • Primary Use: Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset as a baseline for models trained with rationales
  • Research Setting: Well suited for monolingual Telugu NLP tasks, especially in low-resource and explainable AI research

Why IndicBERT?

IndicBERT provides language-aware tokenization, clean embeddings, and faster training for Indian languages.
It is well suited for monolingual Telugu tasks and does not support code-mixed or cross-lingual data. For Telugu sentiment classification, IndicBERT delivers efficient and accurate results due to its tailored pretraining.
With rationale supervision, the model can provide explicit explanations for its predictions.


Performance and Limitations

Strengths:

  • Language-aware tokenization and embeddings for Telugu
  • Faster training and inference compared to larger multilingual models
  • Provides explicit rationales for predictions, aiding explainability
  • Robust baseline for monolingual Telugu sentiment classification

Limitations:

  • Not suitable for code-mixed or cross-lingual tasks
  • Telugu-specific models may outperform on highly nuanced or domain-specific data

Training Data

  • Dataset: TeSent_Benchmark-Dataset
  • Data Used: The Content (Telugu sentence), Label (sentiment label), and Rationale (human-annotated rationale) columns are used for IndicBERT_WR training

Language Coverage

  • Language: Telugu (te)
  • Model Scope: Strictly monolingual Telugu sentiment classification

Citation and More Details

For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, please refer to our paper.


License

Released under CC BY 4.0.

Downloads last month
14
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DSL-13-SRMAP/IndicBERT_WR