|
--- |
|
language: en |
|
license: mit |
|
tags: |
|
- text-classification |
|
- bot-detection |
|
- social-media |
|
- distilroberta |
|
- pytorch |
|
- transformers |
|
datasets: |
|
- custom |
|
widget: |
|
- text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123" |
|
example_title: "Promotional Bot Text" |
|
- text: "Just finished reading an interesting article about machine learning applications in healthcare." |
|
example_title: "Human-like Text" |
|
- text: "Follow for follow? Like my posts and I'll like yours back! 💯" |
|
example_title: "Social Media Bot" |
|
- text: "Had a wonderful dinner with my family tonight. These moments are precious." |
|
example_title: "Authentic Human Text" |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- precision |
|
- recall |
|
model-index: |
|
- name: distilroberta-bot-detection |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Bot Detection |
|
metrics: |
|
- type: accuracy |
|
value: 0.9423 |
|
name: Test Accuracy |
|
- type: f1 |
|
value: 0.9424 |
|
name: Test F1-Score (Weighted) |
|
- type: precision |
|
value: 0.9428 |
|
name: Test Precision (Weighted) |
|
- type: recall |
|
value: 0.9423 |
|
name: Test Recall (Weighted) |
|
--- |
|
|
|
# Bot Detection Model - DistilRoBERTa |
|
|
|
## Model Description |
|
|
|
This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation. |
|
|
|
## Performance |
|
|
|
### Cross-Validation Results (5-Fold) |
|
| Metric | Mean ± Std | Range | |
|
|--------|------------|-------| |
|
| **Accuracy** | 0.9433 ± 0.0052 | 0.9385 - 0.9497 | |
|
| **F1-Score (Weighted)** | 0.9434 ± 0.0051 | 0.9387 - 0.9497 | |
|
| **Precision (Weighted)** | 0.9444 ± 0.0045 | 0.9397 - 0.9498 | |
|
|
|
### Test Set Performance |
|
- **Accuracy**: 0.9423 |
|
- **F1-Score (Weighted)**: 0.9424 |
|
- **Precision (Weighted)**: 0.9428 |
|
- **Recall (Weighted)**: 0.9423 |
|
- **Inference Speed**: 232.83 samples/second |
|
|
|
## Usage |
|
|
|
### Quick Start |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
import re |
|
|
|
# Load model and tokenizer |
|
model_name = "junaid1993/distilroberta-bot-detection" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
def preprocess_text(text): |
|
"""Clean text for bot detection""" |
|
if not isinstance(text, str): |
|
return "" |
|
|
|
# Remove URLs |
|
text = re.sub(r'http\S+|www\.\S+', '', text) |
|
# Remove @ and # symbols |
|
text = re.sub(r'[@#]', '', text) |
|
# Remove punctuation and special characters |
|
text = re.sub(r'[^\w\s]', '', text) |
|
# Remove numbers |
|
text = re.sub(r'\d+', '', text) |
|
# Clean whitespace |
|
text = re.sub(r'\s+', ' ', text).strip() |
|
|
|
return text.lower() |
|
|
|
def predict_bot(text, threshold=0.5): |
|
"""Predict if text is bot-generated""" |
|
clean_text = preprocess_text(text) |
|
|
|
if not clean_text: |
|
return {"prediction": "unknown", "confidence": 0.5} |
|
|
|
inputs = tokenizer( |
|
clean_text, |
|
return_tensors="pt", |
|
truncation=True, |
|
padding=True, |
|
max_length=512 |
|
) |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
|
bot_prob = probabilities[0][1].item() |
|
prediction = "bot" if bot_prob > threshold else "human" |
|
|
|
return { |
|
"prediction": prediction, |
|
"bot_probability": round(bot_prob, 4), |
|
"human_probability": round(probabilities[0][0].item(), 4) |
|
} |
|
|
|
# Example usage |
|
text = "🔥 AMAZING DEAL! Click here now!" |
|
result = predict_bot(text) |
|
print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Model Architecture |
|
- **Base Model**: distilroberta-base |
|
- **Task**: Binary sequence classification |
|
- **Classes**: Human (0) vs Bot (1) |
|
- **Parameters**: ~82M parameters |
|
|
|
### Training Configuration |
|
- **Epochs**: 10 (with early stopping) |
|
- **Batch Size**: 2 per device, gradient accumulation steps: 8 |
|
- **Learning Rate**: Automatic (AdamW optimizer) |
|
- **Weight Decay**: 0.01 |
|
- **Mixed Precision**: FP16 |
|
- **Class Weighting**: Applied to handle dataset imbalance |
|
|
|
### Data Preprocessing |
|
1. URL removal |
|
2. Special character cleaning (@ symbols, hashtags) |
|
3. Punctuation removal |
|
4. Number removal |
|
5. Whitespace normalization |
|
6. Lowercase conversion |
|
|
|
### Validation Methodology |
|
- **Cross-Validation**: 5-fold Stratified K-Fold |
|
- **Test Split**: 20% holdout set |
|
- **Metrics**: Accuracy, Precision, Recall, F1-score (both weighted and macro) |
|
|
|
## Limitations |
|
|
|
- **Domain**: Primarily trained on social media text patterns |
|
- **Language**: English text only |
|
- **Temporal**: Bot patterns may evolve over time, requiring retraining |
|
- **Context**: Performance may vary with text length and complexity |
|
|
|
## Intended Use |
|
|
|
This model is designed for: |
|
- Social media content moderation |
|
- Academic research on bot detection |
|
- Content analysis and verification |
|
|
|
## Ethical Considerations |
|
|
|
- This model should be used responsibly and not for harassment |
|
- Results should be interpreted with appropriate confidence thresholds |
|
- Human oversight is recommended for critical decisions |
|
- Regular model updates may be needed as bot techniques evolve |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@model{distilroberta-bot-detection-2024, |
|
title={Bot Detection Model using DistilRoBERTa}, |
|
author={Junaid Ahmed and Dariusz Jemielniak and Leon Ciechanowski}, |
|
year={2025}, |
|
publisher={Hugging Face}, |
|
url={https://huggingface.co/junaid1993/distilroberta-bot-detection} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
MIT License |
|
|
|
--- |
|
|
|
**Model Card Created**: 2025-08-23 |
|
**Framework**: PyTorch + Transformers |
|
**Validation**: 5-Fold Cross-Validation with Class Weighting |
|
|