Urdu RoBERTa Hate Speech Classifier (Balanced)
- Base model:
urduhack/roberta-urdu-small
- Task: Binary text classification (hate vs. not_hate)
- Language: Urdu (ur)
- Labels
- 0 →
not_hate
- 1 →
hate
- 0 →
This model fine-tunes a small RoBERTa for Urdu hate-speech detection. Class imbalance was addressed by oversampling with SMOTE at the feature level (TF–IDF) prior to tokenization-based training.
Training data and preprocessing
- Source dataset:
Adnan855570/urdu-hate-speech
(Excel files:preprocessed_combined_file (1).xlsx
,Urdu_Hate_Speech.xlsx
) - Columns used in notebook:
Tweet
(text),Tag
(label in {0,1}) - Steps:
- TF–IDF featurization (max_features=10000)
- SMOTE oversampling (random_state=42) to balance classes
- Train/test split: 80/20 (random_state=42)
- Tokenization:
AutoTokenizer.from_pretrained("urduhack/roberta-urdu-small")
withtruncation=True
,padding=True
Training setup
- Model:
AutoModelForSequenceClassification
withnum_labels=2
- Device: GPU if available
- Hyperparameters:
- epochs: 3
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- warmup_steps: 500
- weight_decay: 0.01
- evaluation_strategy: epoch
- save_strategy: epoch
- load_best_model_at_end: true
- Metrics:
- Accuracy, Precision, Recall, F1 (binary)
Evaluation results (test split)
- accuracy: 0.7891
- f1: 0.7854
- precision: 0.8208
- recall: 0.7529
Note: Results derive from the balanced (SMOTE) dataset and the 80/20 split used in the notebook.
How to use (Transformers)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
MODEL_ID = "Adnan855570/urdu-roberta-hate" # replace if different
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).eval()
id2label = model.config.id2label or {"0":"not_hate","1":"hate"}
def predict(text: str):
enc = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = model(**enc).logits
probs = logits.softmax(dim=-1).squeeze().tolist()
pred = int(logits.argmax(dim=-1).item())
return {"label_id": pred, "label": id2label.get(str(pred), str(pred)),
"scores": {"not_hate": probs[0], "hate": probs[1]}}
print(predict("یہ نفرت انگیز ہے یا نہیں؟"))
Or with a pipeline:
from transformers import pipeline
clf = pipeline("text-classification", model="Adnan855570/urdu-roberta-hate", top_k=None)
print(clf("یہ نفرت انگیز ہے یا نہیں؟"))
Inference API
- cURL
curl -X POST -H "Authorization: Bearer $HF_TOKEN" -H "Content-Type: application/json" \
-d '{"inputs":"یہ نفرت انگیز ہے یا نہیں؟"}' \
https://api-inference.huggingface.co/models/Adnan855570/urdu-roberta-hate
- Python
import os, requests
API_URL = "https://api-inference.huggingface.co/models/Adnan855570/urdu-roberta-hate"
HEADERS = {"Authorization": f"Bearer {os.environ.get('HF_TOKEN','')}"}
print(requests.post(API_URL, headers=HEADERS, json={"inputs":"..."}, timeout=30).json())
Intended uses and limitations
- Intended:
- Flagging potentially hateful Urdu content
- Assisting human moderation and research
- Limitations:
- May misclassify satire, reclaimed slurs, or code-mixed content
- Domain shift sensitivity (platform/community/topic)
- Risks:
- False positives/negatives; do not use as the sole basis for punitive actions
- Recommendation:
- Use with human-in-the-loop; periodically audit outcomes and bias
Label mapping
Ensure the config includes:
id2label = {"0":"not_hate","1":"hate"}
label2id = {"not_hate":0,"hate":1}
Reproducibility notes
- SMOTE and split seeds:
random_state=42
- Tokenization: truncation and padding enabled (no explicit max_length set in notebook)
- Hardware: single GPU (e.g., Colab)
License
- The model derivation should comply with the base model’s license (
urduhack/roberta-urdu-small
). Set a compatible license here once confirmed.
Citation
@misc{urdu_roberta_hate_balanced_2025,
title = {Urdu RoBERTa Hate Speech Classifier (Balanced)},
author = {Adnan},
year = {2025},
howpublished = {\url{https://huggingface.co/Adnan855570/urdu-roberta-hate}}
}
Acknowledgements
- Base:
urduhack/roberta-urdu-small
- Libraries: 🤗 Transformers, Datasets, PyTorch
- Oversampling: SMOTE (imblearn)
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support