SubRoBERTa: Reddit Subreddit Classification Model
This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.
Model Description
- Model type: RoBERTa-base fine-tuned for sequence classification
- Language: English
- License: MIT
- Finetuned from model: roberta-base
Intended Uses & Limitations
This model is intended to be used for:
- Classifying text into one of the following subreddits:
- r/aitah
- r/buildapc
- r/dating_advice
- r/legaladvice
- r/minecraft
- r/nostupidquestions
- r/pcmasterrace
- r/relationship_advice
- r/techsupport
- r/teenagers
Limitations
- The model was trained on English text only
- Performance may vary for texts that are significantly different from the training data
- The model may not perform well on texts that don't clearly belong to any of the target subreddits
Usage
Here's how to use the model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
# Load model and tokenizer
model_name = "marcoallanda/SubRoBERTa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "My computer won't turn on, what should I do?"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = F.softmax(logits, dim=-1)
pred_id = torch.argmax(probs, dim=-1).item()
pred_label = model.config.id2label[pred_id]
print(f"Predicted subreddit: {pred_label}")
Training and Evaluation Data
The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.
Training Procedure
- Training regime: Fine-tuning
- Learning rate: 2e-5
- Number of epochs: 10
- Batch size: 128
- Optimizer: AdamW
- Mixed precision: FP16
Training Results
The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.
Citation
If you use this model in your research, please cite:
@misc{SubRoBERTa,
author = {Marco Allanda},
title = {SubRoBERTa: Reddit Subreddit Classification Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
}
- Downloads last month
- 22
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for marcoallanda/SubRoBERTa
Base model
FacebookAI/roberta-base