File size: 5,812 Bytes
9bcdf02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
678e326
 
9bcdf02
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
language: en
license: mit
tags:
- text-classification
- bot-detection
- social-media
- distilroberta
- pytorch
- transformers
datasets:
- custom
widget:
- text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123"
  example_title: "Promotional Bot Text"
- text: "Just finished reading an interesting article about machine learning applications in healthcare."
  example_title: "Human-like Text"
- text: "Follow for follow? Like my posts and I'll like yours back! 💯"
  example_title: "Social Media Bot"
- text: "Had a wonderful dinner with my family tonight. These moments are precious."
  example_title: "Authentic Human Text"
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: distilroberta-bot-detection
  results:
  - task:
      type: text-classification
      name: Bot Detection
    metrics:
    - type: accuracy
      value: 0.9423
      name: Test Accuracy
    - type: f1
      value: 0.9424
      name: Test F1-Score (Weighted)
    - type: precision
      value: 0.9428
      name: Test Precision (Weighted)
    - type: recall
      value: 0.9423
      name: Test Recall (Weighted)
---

# Bot Detection Model - DistilRoBERTa

## Model Description

This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation.

## Performance

### Cross-Validation Results (5-Fold)
| Metric | Mean ± Std | Range |
|--------|------------|-------|
| **Accuracy** | 0.9433 ± 0.0052 | 0.9385 - 0.9497 |
| **F1-Score (Weighted)** | 0.9434 ± 0.0051 | 0.9387 - 0.9497 |
| **Precision (Weighted)** | 0.9444 ± 0.0045 | 0.9397 - 0.9498 |

### Test Set Performance
- **Accuracy**: 0.9423
- **F1-Score (Weighted)**: 0.9424
- **Precision (Weighted)**: 0.9428
- **Recall (Weighted)**: 0.9423
- **Inference Speed**: 232.83 samples/second

## Usage

### Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import re

# Load model and tokenizer
model_name = "junaid1993/distilroberta-bot-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def preprocess_text(text):
    """Clean text for bot detection"""
    if not isinstance(text, str):
        return ""
    
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Remove @ and # symbols
    text = re.sub(r'[@#]', '', text)
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Clean whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text.lower()

def predict_bot(text, threshold=0.5):
    """Predict if text is bot-generated"""
    clean_text = preprocess_text(text)
    
    if not clean_text:
        return {"prediction": "unknown", "confidence": 0.5}
    
    inputs = tokenizer(
        clean_text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=512
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    bot_prob = probabilities[0][1].item()
    prediction = "bot" if bot_prob > threshold else "human"
    
    return {
        "prediction": prediction,
        "bot_probability": round(bot_prob, 4),
        "human_probability": round(probabilities[0][0].item(), 4)
    }

# Example usage
text = "🔥 AMAZING DEAL! Click here now!"
result = predict_bot(text)
print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})")
```

## Training Details

### Model Architecture
- **Base Model**: distilroberta-base
- **Task**: Binary sequence classification
- **Classes**: Human (0) vs Bot (1)
- **Parameters**: ~82M parameters

### Training Configuration
- **Epochs**: 10 (with early stopping)
- **Batch Size**: 2 per device, gradient accumulation steps: 8
- **Learning Rate**: Automatic (AdamW optimizer)
- **Weight Decay**: 0.01
- **Mixed Precision**: FP16
- **Class Weighting**: Applied to handle dataset imbalance

### Data Preprocessing
1. URL removal
2. Special character cleaning (@ symbols, hashtags)
3. Punctuation removal
4. Number removal
5. Whitespace normalization
6. Lowercase conversion

### Validation Methodology
- **Cross-Validation**: 5-fold Stratified K-Fold
- **Test Split**: 20% holdout set
- **Metrics**: Accuracy, Precision, Recall, F1-score (both weighted and macro)

## Limitations

- **Domain**: Primarily trained on social media text patterns
- **Language**: English text only
- **Temporal**: Bot patterns may evolve over time, requiring retraining
- **Context**: Performance may vary with text length and complexity

## Intended Use

This model is designed for:
- Social media content moderation
- Academic research on bot detection
- Content analysis and verification

## Ethical Considerations

- This model should be used responsibly and not for harassment
- Results should be interpreted with appropriate confidence thresholds
- Human oversight is recommended for critical decisions
- Regular model updates may be needed as bot techniques evolve

## Citation

```bibtex
@model{distilroberta-bot-detection-2024,
  title={Bot Detection Model using DistilRoBERTa},
  author={Junaid Ahmed and Dariusz Jemielniak and Leon Ciechanowski},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/junaid1993/distilroberta-bot-detection}
}
```

## License

MIT License

---

**Model Card Created**: 2025-08-23  
**Framework**: PyTorch + Transformers  
**Validation**: 5-Fold Cross-Validation with Class Weighting