IndicBERT_WOR / README.md
Raj411's picture
Create README.md
868e6ca verified
metadata
license: cc-by-4.0
tags:
  - sentiment-classification
  - telugu
  - indicbert
  - indian-languages
  - baseline
language: te
datasets:
  - DSL-13-SRMAP/TeSent_Benchmark-Dataset
model_name: IndicBERT_WOR

IndicBERT_WOR: IndicBERT Telugu Sentiment Classification Model (Without Rationale)

Model Overview

IndicBERT_WOR is a Telugu sentiment classification model based on IndicBERT (ai4bharat/indicBERTv2-MLM-only), a multilingual BERT-like transformer developed by AI4Bharat.
The "WOR" in the model name stands for "Without Rationale", meaning this model is trained only with sentiment labels from the TeSent_Benchmark-Dataset and does not use human-annotated rationales.


Model Details

  • Architecture: IndicBERT (BERT-like, multilingual for Indian languages)
  • Pretraining Data: OSCAR and AI4Bharat curated corpora for 12 Indian languages (including Telugu and English)
  • Pretraining Objective: Masked Language Modeling (MLM)
  • Fine-tuning Data: TeSent_Benchmark-Dataset, using only sentence-level sentiment labels (positive, negative, neutral); rationale annotations are disregarded
  • Task: Sentence-level sentiment classification (3-way)
  • Rationale Usage: Not used during training or inference ("WOR" = Without Rationale)

Intended Use

  • Primary Use: Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset as a baseline for models trained without rationales
  • Research Setting: Well suited for monolingual Telugu NLP tasks, especially in low-resource and explainable AI research

Why IndicBERT?

IndicBERT provides language-aware tokenization, clean embeddings, and faster training for Indian languages.
It is well suited for monolingual Telugu tasks, but does not support code-mixed data or cross-lingual transfer. For Telugu sentiment classification, IndicBERT delivers efficient and accurate results due to its tailored pretraining.


Performance and Limitations

Strengths:

  • Language-aware tokenization and embeddings for Telugu
  • Faster training and inference compared to larger multilingual models
  • Robust baseline for monolingual Telugu sentiment classification

Limitations:

  • Not suitable for code-mixed or cross-lingual tasks
  • Telugu-specific models may outperform on highly nuanced or domain-specific data
  • Since rationales are not used, the model cannot provide explicit explanations for its predictions

Training Data

  • Dataset: TeSent_Benchmark-Dataset
  • Data Used: Only the Content (Telugu sentence) and Label (sentiment label) columns; rationale annotations are ignored for IndicBERT_WOR training

Language Coverage

  • Language: Telugu (te)
  • Model Scope: Strictly monolingual Telugu sentiment classification

Citation and More Details

For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, please refer to our paper.


License

Released under CC BY 4.0.