DSL-13-SRMAP
/

IndicBERT_WR

+---
+license: cc-by-4.0
+tags:
+- sentiment-classification
+- telugu
+- indicbert
+- indian-languages
+- baseline
+language: te
+datasets:
+- DSL-13-SRMAP/TeSent_Benchmark-Dataset
+model_name: IndicBERT_WR
+---
+# IndicBERT_WR: IndicBERT Telugu Sentiment Classification Model (With Rationale)
+## Model Overview
+**IndicBERT_WR** is a Telugu sentiment classification model based on **IndicBERT (ai4bharat/indicBERTv2-MLM-only)**, a multilingual BERT-like transformer developed by AI4Bharat.
+The "WR" in the model name stands for "**With Rationale**", meaning this model is trained using both sentiment labels and **human-annotated rationales** from the TeSent_Benchmark-Dataset.
+---
+## Model Details
+- **Architecture:** IndicBERT (BERT-like, multilingual for Indian languages)
+- **Pretraining Data:** OSCAR and AI4Bharat curated corpora for 12 Indian languages (including Telugu and English)
+- **Pretraining Objective:** Masked Language Modeling (MLM)
+- **Fine-tuning Data:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset), using both sentence-level sentiment labels (positive, negative, neutral) and rationale annotations
+- **Task:** Sentence-level sentiment classification (3-way)
+- **Rationale Usage:** **Used** during training and/or inference ("WR" = With Rationale)
+---
+## Intended Use
+- **Primary Use:** Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset as a **baseline** for models trained with rationales
+- **Research Setting:** Well suited for monolingual Telugu NLP tasks, especially in low-resource and explainable AI research
+---
+## Why IndicBERT?
+IndicBERT provides language-aware tokenization, clean embeddings, and faster training for Indian languages.
+It is well suited for monolingual Telugu tasks and does not support code-mixed or cross-lingual data. For Telugu sentiment classification, IndicBERT delivers efficient and accurate results due to its tailored pretraining.
+With rationale supervision, the model can provide **explicit explanations** for its predictions.
+---
+## Performance and Limitations
+**Strengths:**
+- Language-aware tokenization and embeddings for Telugu
+- Faster training and inference compared to larger multilingual models
+- Provides **explicit rationales** for predictions, aiding explainability
+- Robust baseline for monolingual Telugu sentiment classification
+**Limitations:**
+- Not suitable for code-mixed or cross-lingual tasks
+- Telugu-specific models may outperform on highly nuanced or domain-specific data
+---
+## Training Data
+- **Dataset:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset)
+- **Data Used:** The **Content** (Telugu sentence), **Label** (sentiment label), and **Rationale** (human-annotated rationale) columns are used for IndicBERT_WR training
+---
+## Language Coverage
+- **Language:** Telugu (`te`)
+- **Model Scope:** Strictly monolingual Telugu sentiment classification
+---
+## Citation and More Details
+For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, **please refer to our paper**.
+---
+## License
+Released under [CC BY 4.0](LICENSE).