--- license: cc-by-4.0 tags: - sentiment-classification - telugu - indicbert - indian-languages - baseline language: te datasets: - DSL-13-SRMAP/TeSent_Benchmark-Dataset model_name: IndicBERT_WOR --- # IndicBERT_WOR: IndicBERT Telugu Sentiment Classification Model (Without Rationale) ## Model Overview **IndicBERT_WOR** is a Telugu sentiment classification model based on **IndicBERT (ai4bharat/indicBERTv2-MLM-only)**, a multilingual BERT-like transformer developed by AI4Bharat. The "WOR" in the model name stands for "**Without Rationale**", meaning this model is trained only with sentiment labels from the TeSent_Benchmark-Dataset and **does not use human-annotated rationales**. --- ## Model Details - **Architecture:** IndicBERT (BERT-like, multilingual for Indian languages) - **Pretraining Data:** OSCAR and AI4Bharat curated corpora for 12 Indian languages (including Telugu and English) - **Pretraining Objective:** Masked Language Modeling (MLM) - **Fine-tuning Data:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset), using only sentence-level sentiment labels (positive, negative, neutral); rationale annotations are disregarded - **Task:** Sentence-level sentiment classification (3-way) - **Rationale Usage:** **Not used** during training or inference ("WOR" = Without Rationale) --- ## Intended Use - **Primary Use:** Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset as a **baseline** for models trained without rationales - **Research Setting:** Well suited for monolingual Telugu NLP tasks, especially in low-resource and explainable AI research --- ## Why IndicBERT? IndicBERT provides language-aware tokenization, clean embeddings, and faster training for Indian languages. It is well suited for monolingual Telugu tasks, but does not support code-mixed data or cross-lingual transfer. For Telugu sentiment classification, IndicBERT delivers efficient and accurate results due to its tailored pretraining. --- ## Performance and Limitations **Strengths:** - Language-aware tokenization and embeddings for Telugu - Faster training and inference compared to larger multilingual models - Robust baseline for monolingual Telugu sentiment classification **Limitations:** - Not suitable for code-mixed or cross-lingual tasks - Telugu-specific models may outperform on highly nuanced or domain-specific data - Since rationales are not used, the model cannot provide explicit explanations for its predictions --- ## Training Data - **Dataset:** [TeSent_Benchmark-Dataset](https://huggingface.co/datasets/dsl-13-srmap/tesent_benchmark-dataset) - **Data Used:** Only the **Content** (Telugu sentence) and **Label** (sentiment label) columns; **rationale** annotations are ignored for IndicBERT_WOR training --- ## Language Coverage - **Language:** Telugu (`te`) - **Model Scope:** Strictly monolingual Telugu sentiment classification --- ## Citation and More Details For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, **please refer to our paper**. --- ## License Released under [CC BY 4.0](LICENSE).