Model Overview

This model is a fine-tuned version of AmaanP314/youtube-xlm-roberta-base-sentiment-multilingual. While the base model was trained on multilingual English-only YouTube comments, this version has been fine-tuned on a large dataset of Telugu comments, enabling it to classify Telugu (native script), transliterated Telugu, and English YouTube comments into sentiment categories: Negative, Neutral, and Positive.

Model Details

Base model: AmaanP314/youtube-xlm-roberta-base-sentiment-multilingual

Fine-tuned for: Telugu + English YouTube comment sentiment analysis

Languages Supported:

Telugu (native script)

Transliterated Telugu

English

Labels:

0: Negative

1: Neutral

2: Positive

Dataset & Labeling

Source: Comments were extracted from YouTube using the YouTube Data API.

Comment Count:

Train set: 73,943 comments

Validation set: 8,216 comments

Labeling Method: Comments were labeled using Gemini 1.5 Pro (Google’s LLM) via a sentiment classification prompt to auto-assign one of the three sentiment classes.

How to Use

The model can be used via an API endpoint or loaded locally using the Hugging Face Transformers library. For example, using Python:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "gajula21/youtube-sentiment-model-telugu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

comments = [
    "ఈ సినిమా చాలా బాగుంది!",  
    "ఈ వీడియో చాలా బోరు పడింది",  
    "ఇది మామూలు వీడియో",  
]

inputs = tokenizer(comments, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=1)
label_mapping = {0: "Negative", 1: "Neutral", 2: "Positive"}
sentiments = [label_mapping[p.item()] for p in predictions]

print(sentiments)

Training Configuration

Framework: Hugging Face Transformers (PyTorch)

Tokenizer: AutoTokenizer from base model

Loss Function: CrossEntropyLoss with label_smoothing=0.1

Batch Size: 1176 (per device)

Gradient Accumulation Steps: 2

Learning Rate: 1e-5

Weight Decay: 0.05

Epochs: 3

Evaluation Strategy: Every 125 steps

Early Stopping: Patience of 5 evaluation steps

Mixed Precision: Enabled (fp16)

Evaluation Results

Step Training Loss Validation Loss Accuracy
125 0.7637 0.7355 72.97%
250 0.7289 0.7110 74.57%
375 0.7155 0.6982 75.72%
500 0.6912 0.7005 75.58%
625 0.6851 0.6821 76.79%
750 0.6606 0.6897 76.61%
875 0.6464 0.6838 76.68%
1000 0.6542 0.6676 77.45%
1125 0.6501 0.6602 78.04%
1250 0.6374 0.6730 77.81%
1375 0.6143 0.6682 77.99%
1500 0.6175 0.6665 78.10%
1625 0.6183 0.6646 78.16%

Citation [optional]

@misc{gajula21_youtube_sentiment_2025,
  author = {Gajula Vivek},
  title = {Telugu-English YouTube Sentiment Classifier},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/gajula21/youtube-sentiment-model-telugu}},
}
Downloads last month
28
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gajula21/youtube-sentiment-model-telugu

Space using gajula21/youtube-sentiment-model-telugu 1