File size: 3,204 Bytes
2ff26da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: cc-by-4.0
tags:
- sentiment-classification
- telugu
- indicbert
- indian-languages
- baseline
language: te
datasets:
- DSL-13-SRMAP/TeSent_Benchmark-Dataset
model_name: IndicBERT_WR
---

# IndicBERT_WR: IndicBERT Telugu Sentiment Classification Model (With Rationale)

## Model Overview

**IndicBERT_WR** is a Telugu sentiment classification model based on **IndicBERT (ai4bharat/indicBERTv2-MLM-only)**, a multilingual BERT-like transformer developed by AI4Bharat.  
The "WR" in the model name stands for "**With Rationale**", meaning this model is trained using both sentiment labels and **human-annotated rationales** from the TeSent_Benchmark-Dataset.

---

## Model Details

- **Architecture:** IndicBERT (BERT-like, multilingual for Indian languages)
- **Pretraining Data:** OSCAR and AI4Bharat curated corpora for 12 Indian languages (including Telugu and English)
- **Pretraining Objective:** Masked Language Modeling (MLM)
- **Fine-tuning Data:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset), using both sentence-level sentiment labels (positive, negative, neutral) and rationale annotations
- **Task:** Sentence-level sentiment classification (3-way)
- **Rationale Usage:** **Used** during training and/or inference ("WR" = With Rationale)

---

## Intended Use

- **Primary Use:** Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset as a **baseline** for models trained with rationales
- **Research Setting:** Well suited for monolingual Telugu NLP tasks, especially in low-resource and explainable AI research

---

## Why IndicBERT?

IndicBERT provides language-aware tokenization, clean embeddings, and faster training for Indian languages.  
It is well suited for monolingual Telugu tasks and does not support code-mixed or cross-lingual data. For Telugu sentiment classification, IndicBERT delivers efficient and accurate results due to its tailored pretraining.  
With rationale supervision, the model can provide **explicit explanations** for its predictions.

---

## Performance and Limitations

**Strengths:**  
- Language-aware tokenization and embeddings for Telugu
- Faster training and inference compared to larger multilingual models
- Provides **explicit rationales** for predictions, aiding explainability
- Robust baseline for monolingual Telugu sentiment classification

**Limitations:**  
- Not suitable for code-mixed or cross-lingual tasks
- Telugu-specific models may outperform on highly nuanced or domain-specific data

---

## Training Data

- **Dataset:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset)
- **Data Used:** The **Content** (Telugu sentence), **Label** (sentiment label), and **Rationale** (human-annotated rationale) columns are used for IndicBERT_WR training

---

## Language Coverage

- **Language:** Telugu (`te`)
- **Model Scope:** Strictly monolingual Telugu sentiment classification

---

## Citation and More Details

For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, **please refer to our paper**.



---

## License

Released under [CC BY 4.0](LICENSE).