File size: 5,812 Bytes
9bcdf02 678e326 9bcdf02 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
language: en
license: mit
tags:
- text-classification
- bot-detection
- social-media
- distilroberta
- pytorch
- transformers
datasets:
- custom
widget:
- text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123"
example_title: "Promotional Bot Text"
- text: "Just finished reading an interesting article about machine learning applications in healthcare."
example_title: "Human-like Text"
- text: "Follow for follow? Like my posts and I'll like yours back! 💯"
example_title: "Social Media Bot"
- text: "Had a wonderful dinner with my family tonight. These moments are precious."
example_title: "Authentic Human Text"
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: distilroberta-bot-detection
results:
- task:
type: text-classification
name: Bot Detection
metrics:
- type: accuracy
value: 0.9423
name: Test Accuracy
- type: f1
value: 0.9424
name: Test F1-Score (Weighted)
- type: precision
value: 0.9428
name: Test Precision (Weighted)
- type: recall
value: 0.9423
name: Test Recall (Weighted)
---
# Bot Detection Model - DistilRoBERTa
## Model Description
This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation.
## Performance
### Cross-Validation Results (5-Fold)
| Metric | Mean ± Std | Range |
|--------|------------|-------|
| **Accuracy** | 0.9433 ± 0.0052 | 0.9385 - 0.9497 |
| **F1-Score (Weighted)** | 0.9434 ± 0.0051 | 0.9387 - 0.9497 |
| **Precision (Weighted)** | 0.9444 ± 0.0045 | 0.9397 - 0.9498 |
### Test Set Performance
- **Accuracy**: 0.9423
- **F1-Score (Weighted)**: 0.9424
- **Precision (Weighted)**: 0.9428
- **Recall (Weighted)**: 0.9423
- **Inference Speed**: 232.83 samples/second
## Usage
### Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import re
# Load model and tokenizer
model_name = "junaid1993/distilroberta-bot-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def preprocess_text(text):
"""Clean text for bot detection"""
if not isinstance(text, str):
return ""
# Remove URLs
text = re.sub(r'http\S+|www\.\S+', '', text)
# Remove @ and # symbols
text = re.sub(r'[@#]', '', text)
# Remove punctuation and special characters
text = re.sub(r'[^\w\s]', '', text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Clean whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
def predict_bot(text, threshold=0.5):
"""Predict if text is bot-generated"""
clean_text = preprocess_text(text)
if not clean_text:
return {"prediction": "unknown", "confidence": 0.5}
inputs = tokenizer(
clean_text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
bot_prob = probabilities[0][1].item()
prediction = "bot" if bot_prob > threshold else "human"
return {
"prediction": prediction,
"bot_probability": round(bot_prob, 4),
"human_probability": round(probabilities[0][0].item(), 4)
}
# Example usage
text = "🔥 AMAZING DEAL! Click here now!"
result = predict_bot(text)
print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})")
```
## Training Details
### Model Architecture
- **Base Model**: distilroberta-base
- **Task**: Binary sequence classification
- **Classes**: Human (0) vs Bot (1)
- **Parameters**: ~82M parameters
### Training Configuration
- **Epochs**: 10 (with early stopping)
- **Batch Size**: 2 per device, gradient accumulation steps: 8
- **Learning Rate**: Automatic (AdamW optimizer)
- **Weight Decay**: 0.01
- **Mixed Precision**: FP16
- **Class Weighting**: Applied to handle dataset imbalance
### Data Preprocessing
1. URL removal
2. Special character cleaning (@ symbols, hashtags)
3. Punctuation removal
4. Number removal
5. Whitespace normalization
6. Lowercase conversion
### Validation Methodology
- **Cross-Validation**: 5-fold Stratified K-Fold
- **Test Split**: 20% holdout set
- **Metrics**: Accuracy, Precision, Recall, F1-score (both weighted and macro)
## Limitations
- **Domain**: Primarily trained on social media text patterns
- **Language**: English text only
- **Temporal**: Bot patterns may evolve over time, requiring retraining
- **Context**: Performance may vary with text length and complexity
## Intended Use
This model is designed for:
- Social media content moderation
- Academic research on bot detection
- Content analysis and verification
## Ethical Considerations
- This model should be used responsibly and not for harassment
- Results should be interpreted with appropriate confidence thresholds
- Human oversight is recommended for critical decisions
- Regular model updates may be needed as bot techniques evolve
## Citation
```bibtex
@model{distilroberta-bot-detection-2024,
title={Bot Detection Model using DistilRoBERTa},
author={Junaid Ahmed and Dariusz Jemielniak and Leon Ciechanowski},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/junaid1993/distilroberta-bot-detection}
}
```
## License
MIT License
---
**Model Card Created**: 2025-08-23
**Framework**: PyTorch + Transformers
**Validation**: 5-Fold Cross-Validation with Class Weighting
|