junaid1993

Update README.md

678e326 verified 12 days ago

5.81 kB

	---
	language: en
	license: mit
	tags:
	- text-classification
	- bot-detection
	- social-media
	- distilroberta
	- pytorch
	- transformers
	datasets:
	- custom
	widget:
	- text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123"
	example_title: "Promotional Bot Text"
	- text: "Just finished reading an interesting article about machine learning applications in healthcare."
	example_title: "Human-like Text"
	- text: "Follow for follow? Like my posts and I'll like yours back! 💯"
	example_title: "Social Media Bot"
	- text: "Had a wonderful dinner with my family tonight. These moments are precious."
	example_title: "Authentic Human Text"
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: distilroberta-bot-detection
	results:
	- task:
	type: text-classification
	name: Bot Detection
	metrics:
	- type: accuracy
	value: 0.9423
	name: Test Accuracy
	- type: f1
	value: 0.9424
	name: Test F1-Score (Weighted)
	- type: precision
	value: 0.9428
	name: Test Precision (Weighted)
	- type: recall
	value: 0.9423
	name: Test Recall (Weighted)
	---

	# Bot Detection Model - DistilRoBERTa

	## Model Description

	This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation.

	## Performance

	### Cross-Validation Results (5-Fold)
	\| Metric \| Mean ± Std \| Range \|
	\|--------\|------------\|-------\|
	\| Accuracy \| 0.9433 ± 0.0052 \| 0.9385 - 0.9497 \|
	\| F1-Score (Weighted) \| 0.9434 ± 0.0051 \| 0.9387 - 0.9497 \|
	\| Precision (Weighted) \| 0.9444 ± 0.0045 \| 0.9397 - 0.9498 \|

	### Test Set Performance
	- Accuracy: 0.9423
	- F1-Score (Weighted): 0.9424
	- Precision (Weighted): 0.9428
	- Recall (Weighted): 0.9423
	- Inference Speed: 232.83 samples/second

	## Usage

	### Quick Start
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import re

	# Load model and tokenizer
	model_name = "junaid1993/distilroberta-bot-detection"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	def preprocess_text(text):
	"""Clean text for bot detection"""
	if not isinstance(text, str):
	return ""

	# Remove URLs
	text = re.sub(r'http\S+\|www\.\S+', '', text)
	# Remove @ and # symbols
	text = re.sub(r'[@#]', '', text)
	# Remove punctuation and special characters
	text = re.sub(r'[^\w\s]', '', text)
	# Remove numbers
	text = re.sub(r'\d+', '', text)
	# Clean whitespace
	text = re.sub(r'\s+', ' ', text).strip()

	return text.lower()

	def predict_bot(text, threshold=0.5):
	"""Predict if text is bot-generated"""
	clean_text = preprocess_text(text)

	if not clean_text:
	return {"prediction": "unknown", "confidence": 0.5}

	inputs = tokenizer(
	clean_text,
	return_tensors="pt",
	truncation=True,
	padding=True,
	max_length=512
	)

	with torch.no_grad():
	outputs = model(**inputs)
	probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)

	bot_prob = probabilities[0][1].item()
	prediction = "bot" if bot_prob > threshold else "human"

	return {
	"prediction": prediction,
	"bot_probability": round(bot_prob, 4),
	"human_probability": round(probabilities[0][0].item(), 4)
	}

	# Example usage
	text = "🔥 AMAZING DEAL! Click here now!"
	result = predict_bot(text)
	print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})")
	```

	## Training Details

	### Model Architecture
	- Base Model: distilroberta-base
	- Task: Binary sequence classification
	- Classes: Human (0) vs Bot (1)
	- Parameters: ~82M parameters

	### Training Configuration
	- Epochs: 10 (with early stopping)
	- Batch Size: 2 per device, gradient accumulation steps: 8
	- Learning Rate: Automatic (AdamW optimizer)
	- Weight Decay: 0.01
	- Mixed Precision: FP16
	- Class Weighting: Applied to handle dataset imbalance

	### Data Preprocessing
	1. URL removal
	2. Special character cleaning (@ symbols, hashtags)
	3. Punctuation removal
	4. Number removal
	5. Whitespace normalization
	6. Lowercase conversion

	### Validation Methodology
	- Cross-Validation: 5-fold Stratified K-Fold
	- Test Split: 20% holdout set
	- Metrics: Accuracy, Precision, Recall, F1-score (both weighted and macro)

	## Limitations

	- Domain: Primarily trained on social media text patterns
	- Language: English text only
	- Temporal: Bot patterns may evolve over time, requiring retraining
	- Context: Performance may vary with text length and complexity

	## Intended Use

	This model is designed for:
	- Social media content moderation
	- Academic research on bot detection
	- Content analysis and verification

	## Ethical Considerations

	- This model should be used responsibly and not for harassment
	- Results should be interpreted with appropriate confidence thresholds
	- Human oversight is recommended for critical decisions
	- Regular model updates may be needed as bot techniques evolve

	## Citation

	```bibtex
	@model{distilroberta-bot-detection-2024,
	title={Bot Detection Model using DistilRoBERTa},
	author={Junaid Ahmed and Dariusz Jemielniak and Leon Ciechanowski},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/junaid1993/distilroberta-bot-detection}
	}
	```

	## License

	MIT License

	---

	Model Card Created: 2025-08-23
	Framework: PyTorch + Transformers
	Validation: 5-Fold Cross-Validation with Class Weighting