Raj411 commited on
Commit
2ff26da
·
verified ·
1 Parent(s): 31c7642

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ tags:
4
+ - sentiment-classification
5
+ - telugu
6
+ - indicbert
7
+ - indian-languages
8
+ - baseline
9
+ language: te
10
+ datasets:
11
+ - DSL-13-SRMAP/TeSent_Benchmark-Dataset
12
+ model_name: IndicBERT_WR
13
+ ---
14
+
15
+ # IndicBERT_WR: IndicBERT Telugu Sentiment Classification Model (With Rationale)
16
+
17
+ ## Model Overview
18
+
19
+ **IndicBERT_WR** is a Telugu sentiment classification model based on **IndicBERT (ai4bharat/indicBERTv2-MLM-only)**, a multilingual BERT-like transformer developed by AI4Bharat.
20
+ The "WR" in the model name stands for "**With Rationale**", meaning this model is trained using both sentiment labels and **human-annotated rationales** from the TeSent_Benchmark-Dataset.
21
+
22
+ ---
23
+
24
+ ## Model Details
25
+
26
+ - **Architecture:** IndicBERT (BERT-like, multilingual for Indian languages)
27
+ - **Pretraining Data:** OSCAR and AI4Bharat curated corpora for 12 Indian languages (including Telugu and English)
28
+ - **Pretraining Objective:** Masked Language Modeling (MLM)
29
+ - **Fine-tuning Data:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset), using both sentence-level sentiment labels (positive, negative, neutral) and rationale annotations
30
+ - **Task:** Sentence-level sentiment classification (3-way)
31
+ - **Rationale Usage:** **Used** during training and/or inference ("WR" = With Rationale)
32
+
33
+ ---
34
+
35
+ ## Intended Use
36
+
37
+ - **Primary Use:** Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset as a **baseline** for models trained with rationales
38
+ - **Research Setting:** Well suited for monolingual Telugu NLP tasks, especially in low-resource and explainable AI research
39
+
40
+ ---
41
+
42
+ ## Why IndicBERT?
43
+
44
+ IndicBERT provides language-aware tokenization, clean embeddings, and faster training for Indian languages.
45
+ It is well suited for monolingual Telugu tasks and does not support code-mixed or cross-lingual data. For Telugu sentiment classification, IndicBERT delivers efficient and accurate results due to its tailored pretraining.
46
+ With rationale supervision, the model can provide **explicit explanations** for its predictions.
47
+
48
+ ---
49
+
50
+ ## Performance and Limitations
51
+
52
+ **Strengths:**
53
+ - Language-aware tokenization and embeddings for Telugu
54
+ - Faster training and inference compared to larger multilingual models
55
+ - Provides **explicit rationales** for predictions, aiding explainability
56
+ - Robust baseline for monolingual Telugu sentiment classification
57
+
58
+ **Limitations:**
59
+ - Not suitable for code-mixed or cross-lingual tasks
60
+ - Telugu-specific models may outperform on highly nuanced or domain-specific data
61
+
62
+ ---
63
+
64
+ ## Training Data
65
+
66
+ - **Dataset:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset)
67
+ - **Data Used:** The **Content** (Telugu sentence), **Label** (sentiment label), and **Rationale** (human-annotated rationale) columns are used for IndicBERT_WR training
68
+
69
+ ---
70
+
71
+ ## Language Coverage
72
+
73
+ - **Language:** Telugu (`te`)
74
+ - **Model Scope:** Strictly monolingual Telugu sentiment classification
75
+
76
+ ---
77
+
78
+ ## Citation and More Details
79
+
80
+ For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, **please refer to our paper**.
81
+
82
+
83
+
84
+ ---
85
+
86
+ ## License
87
+
88
+ Released under [CC BY 4.0](LICENSE).