MuRIL_WOR / README.md
Raj411's picture
Create README.md
74b3161 verified
metadata
license: cc-by-4.0
tags:
  - sentiment-classification
  - telugu
  - muril
  - indian-languages
  - baseline
  - tesent
language: te
datasets:
  - DSL-13-SRMAP/TeSent_Benchmark-Dataset
model_name: MuRIL_WOR

MuRIL_WOR: MuRIL Telugu Sentiment Classification Model (Without Rationale)

Model Overview

MuRIL_WOR is a Telugu sentiment classification model based on MuRIL (Multilingual Representations for Indian Languages), a transformer-based BERT model designed for 17+ Indian languages, including Telugu and English.
"WOR" in the model name stands for "Without Rationale", meaning this model is trained only with sentiment labels from the TeSent_Benchmark-Dataset and does not use human-annotated rationales.


Model Details

  • Architecture: MuRIL (BERT-base for Indian languages, multilingual)
  • Pretraining Data: Large corpus of Telugu sentences from web, religious scripts, news data, etc.
  • Pretraining Objectives: Masked Language Modeling (MLM) and Translation Language Modeling (TLM)
  • Fine-tuning Data: TeSent_Benchmark-Dataset, using only sentence-level sentiment labels (positive, negative, neutral); rationale annotations are disregarded
  • Task: Sentence-level sentiment classification (3-way)
  • Rationale Usage: Not used during training or inference ("WOR" = Without Rationale)

Intended Use

  • Primary Use: Benchmarking Telugu sentiment classification on the TeSent_Benchmark-Dataset, especially as a baseline for models trained without rationales
  • Research Setting: Recommended for academic research in low-resource NLP settings, especially for informal, social media, or conversational Telugu data

Why MuRIL?

MuRIL is specifically pre-trained on Indian languages and offers better understanding of Telugu morphology and syntax compared to generic multilingual models like mBERT and XLM-R.
Its pre-training favors informal texts from the web, making it especially effective for informal, social media, or conversational NLP tasks in Telugu. For formal/classical Telugu, performance may be lower.


Performance and Limitations

Strengths:

  • Superior understanding of Telugu compared to general multilingual models
  • Excels in informal, web, or conversational Telugu sentiment tasks
  • Robust baseline for Telugu sentiment classification

Limitations:

  • May underperform on formal or classical Telugu tasks due to pre-training corpus
  • Applicability limited to Telugu analysis; not ideal for highly formal text processing
  • Since rationales are not used, the model cannot provide explicit explanations for its predictions

Training Data

  • Dataset: TeSent_Benchmark-Dataset
  • Data Used: Only the Content (Telugu sentence) and Label (sentiment label) columns; rationale annotations are ignored for MuRIL_WOR training

Language Coverage

  • Language: Telugu (te)
  • Model Scope: Strictly focused on monolingual Telugu sentiment classification

Citation and More Details

For detailed experimental setup, evaluation metrics, and comparisons with rationale-based models, please refer to our paper.


License

Released under CC BY 4.0.