junaid1993's picture
Update README.md
678e326 verified
---
language: en
license: mit
tags:
- text-classification
- bot-detection
- social-media
- distilroberta
- pytorch
- transformers
datasets:
- custom
widget:
- text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123"
example_title: "Promotional Bot Text"
- text: "Just finished reading an interesting article about machine learning applications in healthcare."
example_title: "Human-like Text"
- text: "Follow for follow? Like my posts and I'll like yours back! 💯"
example_title: "Social Media Bot"
- text: "Had a wonderful dinner with my family tonight. These moments are precious."
example_title: "Authentic Human Text"
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: distilroberta-bot-detection
results:
- task:
type: text-classification
name: Bot Detection
metrics:
- type: accuracy
value: 0.9423
name: Test Accuracy
- type: f1
value: 0.9424
name: Test F1-Score (Weighted)
- type: precision
value: 0.9428
name: Test Precision (Weighted)
- type: recall
value: 0.9423
name: Test Recall (Weighted)
---
# Bot Detection Model - DistilRoBERTa
## Model Description
This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation.
## Performance
### Cross-Validation Results (5-Fold)
| Metric | Mean ± Std | Range |
|--------|------------|-------|
| **Accuracy** | 0.9433 ± 0.0052 | 0.9385 - 0.9497 |
| **F1-Score (Weighted)** | 0.9434 ± 0.0051 | 0.9387 - 0.9497 |
| **Precision (Weighted)** | 0.9444 ± 0.0045 | 0.9397 - 0.9498 |
### Test Set Performance
- **Accuracy**: 0.9423
- **F1-Score (Weighted)**: 0.9424
- **Precision (Weighted)**: 0.9428
- **Recall (Weighted)**: 0.9423
- **Inference Speed**: 232.83 samples/second
## Usage
### Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import re
# Load model and tokenizer
model_name = "junaid1993/distilroberta-bot-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def preprocess_text(text):
"""Clean text for bot detection"""
if not isinstance(text, str):
return ""
# Remove URLs
text = re.sub(r'http\S+|www\.\S+', '', text)
# Remove @ and # symbols
text = re.sub(r'[@#]', '', text)
# Remove punctuation and special characters
text = re.sub(r'[^\w\s]', '', text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Clean whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
def predict_bot(text, threshold=0.5):
"""Predict if text is bot-generated"""
clean_text = preprocess_text(text)
if not clean_text:
return {"prediction": "unknown", "confidence": 0.5}
inputs = tokenizer(
clean_text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
bot_prob = probabilities[0][1].item()
prediction = "bot" if bot_prob > threshold else "human"
return {
"prediction": prediction,
"bot_probability": round(bot_prob, 4),
"human_probability": round(probabilities[0][0].item(), 4)
}
# Example usage
text = "🔥 AMAZING DEAL! Click here now!"
result = predict_bot(text)
print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})")
```
## Training Details
### Model Architecture
- **Base Model**: distilroberta-base
- **Task**: Binary sequence classification
- **Classes**: Human (0) vs Bot (1)
- **Parameters**: ~82M parameters
### Training Configuration
- **Epochs**: 10 (with early stopping)
- **Batch Size**: 2 per device, gradient accumulation steps: 8
- **Learning Rate**: Automatic (AdamW optimizer)
- **Weight Decay**: 0.01
- **Mixed Precision**: FP16
- **Class Weighting**: Applied to handle dataset imbalance
### Data Preprocessing
1. URL removal
2. Special character cleaning (@ symbols, hashtags)
3. Punctuation removal
4. Number removal
5. Whitespace normalization
6. Lowercase conversion
### Validation Methodology
- **Cross-Validation**: 5-fold Stratified K-Fold
- **Test Split**: 20% holdout set
- **Metrics**: Accuracy, Precision, Recall, F1-score (both weighted and macro)
## Limitations
- **Domain**: Primarily trained on social media text patterns
- **Language**: English text only
- **Temporal**: Bot patterns may evolve over time, requiring retraining
- **Context**: Performance may vary with text length and complexity
## Intended Use
This model is designed for:
- Social media content moderation
- Academic research on bot detection
- Content analysis and verification
## Ethical Considerations
- This model should be used responsibly and not for harassment
- Results should be interpreted with appropriate confidence thresholds
- Human oversight is recommended for critical decisions
- Regular model updates may be needed as bot techniques evolve
## Citation
```bibtex
@model{distilroberta-bot-detection-2024,
title={Bot Detection Model using DistilRoBERTa},
author={Junaid Ahmed and Dariusz Jemielniak and Leon Ciechanowski},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/junaid1993/distilroberta-bot-detection}
}
```
## License
MIT License
---
**Model Card Created**: 2025-08-23
**Framework**: PyTorch + Transformers
**Validation**: 5-Fold Cross-Validation with Class Weighting