Model Overview
This model is a fine-tuned version of AmaanP314/youtube-xlm-roberta-base-sentiment-multilingual. While the base model was trained on multilingual English-only YouTube comments, this version has been fine-tuned on a large dataset of Telugu comments, enabling it to classify Telugu (native script), transliterated Telugu, and English YouTube comments into sentiment categories: Negative, Neutral, and Positive.
Model Details
Base model: AmaanP314/youtube-xlm-roberta-base-sentiment-multilingual
Fine-tuned for: Telugu + English YouTube comment sentiment analysis
Languages Supported:
Telugu (native script)
Transliterated Telugu
English
Labels:
0: Negative
1: Neutral
2: Positive
Dataset & Labeling
Source: Comments were extracted from YouTube using the YouTube Data API.
Comment Count:
Train set: 73,943 comments
Validation set: 8,216 comments
Labeling Method: Comments were labeled using Gemini 1.5 Pro (Google’s LLM) via a sentiment classification prompt to auto-assign one of the three sentiment classes.
How to Use
The model can be used via an API endpoint or loaded locally using the Hugging Face Transformers library. For example, using Python:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "gajula21/youtube-sentiment-model-telugu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
comments = [
"ఈ సినిమా చాలా బాగుంది!",
"ఈ వీడియో చాలా బోరు పడింది",
"ఇది మామూలు వీడియో",
]
inputs = tokenizer(comments, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=1)
label_mapping = {0: "Negative", 1: "Neutral", 2: "Positive"}
sentiments = [label_mapping[p.item()] for p in predictions]
print(sentiments)
Training Configuration
Framework: Hugging Face Transformers (PyTorch)
Tokenizer: AutoTokenizer from base model
Loss Function: CrossEntropyLoss with label_smoothing=0.1
Batch Size: 1176 (per device)
Gradient Accumulation Steps: 2
Learning Rate: 1e-5
Weight Decay: 0.05
Epochs: 3
Evaluation Strategy: Every 125 steps
Early Stopping: Patience of 5 evaluation steps
Mixed Precision: Enabled (fp16)
Evaluation Results
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
125 | 0.7637 | 0.7355 | 72.97% |
250 | 0.7289 | 0.7110 | 74.57% |
375 | 0.7155 | 0.6982 | 75.72% |
500 | 0.6912 | 0.7005 | 75.58% |
625 | 0.6851 | 0.6821 | 76.79% |
750 | 0.6606 | 0.6897 | 76.61% |
875 | 0.6464 | 0.6838 | 76.68% |
1000 | 0.6542 | 0.6676 | 77.45% |
1125 | 0.6501 | 0.6602 | 78.04% |
1250 | 0.6374 | 0.6730 | 77.81% |
1375 | 0.6143 | 0.6682 | 77.99% |
1500 | 0.6175 | 0.6665 | 78.10% |
1625 | 0.6183 | 0.6646 | 78.16% |
Citation [optional]
@misc{gajula21_youtube_sentiment_2025,
author = {Gajula Vivek},
title = {Telugu-English YouTube Sentiment Classifier},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/gajula21/youtube-sentiment-model-telugu}},
}
- Downloads last month
- 28