metadata
language: en
license: mit
tags:
- text-classification
- bot-detection
- social-media
- distilroberta
- pytorch
- transformers
datasets:
- custom
widget:
- text: >-
🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here:
bit.ly/deal123
example_title: Promotional Bot Text
- text: >-
Just finished reading an interesting article about machine learning
applications in healthcare.
example_title: Human-like Text
- text: Follow for follow? Like my posts and I'll like yours back! 💯
example_title: Social Media Bot
- text: Had a wonderful dinner with my family tonight. These moments are precious.
example_title: Authentic Human Text
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: distilroberta-bot-detection
results:
- task:
type: text-classification
name: Bot Detection
metrics:
- type: accuracy
value: 0.9423
name: Test Accuracy
- type: f1
value: 0.9424
name: Test F1-Score (Weighted)
- type: precision
value: 0.9428
name: Test Precision (Weighted)
- type: recall
value: 0.9423
name: Test Recall (Weighted)
Bot Detection Model - DistilRoBERTa
Model Description
This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation.
Performance
Cross-Validation Results (5-Fold)
Metric | Mean ± Std | Range |
---|---|---|
Accuracy | 0.9433 ± 0.0052 | 0.9385 - 0.9497 |
F1-Score (Weighted) | 0.9434 ± 0.0051 | 0.9387 - 0.9497 |
Precision (Weighted) | 0.9444 ± 0.0045 | 0.9397 - 0.9498 |
Test Set Performance
- Accuracy: 0.9423
- F1-Score (Weighted): 0.9424
- Precision (Weighted): 0.9428
- Recall (Weighted): 0.9423
- Inference Speed: 232.83 samples/second
Usage
Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import re
# Load model and tokenizer
model_name = "junaid1993/distilroberta-bot-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def preprocess_text(text):
"""Clean text for bot detection"""
if not isinstance(text, str):
return ""
# Remove URLs
text = re.sub(r'http\S+|www\.\S+', '', text)
# Remove @ and # symbols
text = re.sub(r'[@#]', '', text)
# Remove punctuation and special characters
text = re.sub(r'[^\w\s]', '', text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Clean whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
def predict_bot(text, threshold=0.5):
"""Predict if text is bot-generated"""
clean_text = preprocess_text(text)
if not clean_text:
return {"prediction": "unknown", "confidence": 0.5}
inputs = tokenizer(
clean_text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
bot_prob = probabilities[0][1].item()
prediction = "bot" if bot_prob > threshold else "human"
return {
"prediction": prediction,
"bot_probability": round(bot_prob, 4),
"human_probability": round(probabilities[0][0].item(), 4)
}
# Example usage
text = "🔥 AMAZING DEAL! Click here now!"
result = predict_bot(text)
print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})")
Training Details
Model Architecture
- Base Model: distilroberta-base
- Task: Binary sequence classification
- Classes: Human (0) vs Bot (1)
- Parameters: ~82M parameters
Training Configuration
- Epochs: 10 (with early stopping)
- Batch Size: 2 per device, gradient accumulation steps: 8
- Learning Rate: Automatic (AdamW optimizer)
- Weight Decay: 0.01
- Mixed Precision: FP16
- Class Weighting: Applied to handle dataset imbalance
Data Preprocessing
- URL removal
- Special character cleaning (@ symbols, hashtags)
- Punctuation removal
- Number removal
- Whitespace normalization
- Lowercase conversion
Validation Methodology
- Cross-Validation: 5-fold Stratified K-Fold
- Test Split: 20% holdout set
- Metrics: Accuracy, Precision, Recall, F1-score (both weighted and macro)
Limitations
- Domain: Primarily trained on social media text patterns
- Language: English text only
- Temporal: Bot patterns may evolve over time, requiring retraining
- Context: Performance may vary with text length and complexity
Intended Use
This model is designed for:
- Social media content moderation
- Academic research on bot detection
- Content analysis and verification
Ethical Considerations
- This model should be used responsibly and not for harassment
- Results should be interpreted with appropriate confidence thresholds
- Human oversight is recommended for critical decisions
- Regular model updates may be needed as bot techniques evolve
Citation
@model{distilroberta-bot-detection-2024,
title={Bot Detection Model using DistilRoBERTa},
author={Junaid Ahmed and Dariusz Jemielniak and Leon Ciechanowski},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/junaid1993/distilroberta-bot-detection}
}
License
MIT License
Model Card Created: 2025-08-23
Framework: PyTorch + Transformers
Validation: 5-Fold Cross-Validation with Class Weighting