SubRoBERTa: Reddit Subreddit Classification Model

This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.

Model Description

  • Model type: RoBERTa-base fine-tuned for sequence classification
  • Language: English
  • License: MIT
  • Finetuned from model: roberta-base

Intended Uses & Limitations

This model is intended to be used for:

  • Classifying text into one of the following subreddits:
    • r/aitah
    • r/buildapc
    • r/dating_advice
    • r/legaladvice
    • r/minecraft
    • r/nostupidquestions
    • r/pcmasterrace
    • r/relationship_advice
    • r/techsupport
    • r/teenagers

Limitations

  • The model was trained on English text only
  • Performance may vary for texts that are significantly different from the training data
  • The model may not perform well on texts that don't clearly belong to any of the target subreddits

Usage

Here's how to use the model:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load model and tokenizer
model_name = "marcoallanda/SubRoBERTa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "My computer won't turn on, what should I do?"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=-1)
    pred_id = torch.argmax(probs, dim=-1).item()
    pred_label = model.config.id2label[pred_id]

print(f"Predicted subreddit: {pred_label}")

Training and Evaluation Data

The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.

Training Procedure

  • Training regime: Fine-tuning
  • Learning rate: 2e-5
  • Number of epochs: 10
  • Batch size: 128
  • Optimizer: AdamW
  • Mixed precision: FP16

Training Results

The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.

Citation

If you use this model in your research, please cite:

@misc{SubRoBERTa,
  author = {Marco Allanda},
  title = {SubRoBERTa: Reddit Subreddit Classification Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
}
Downloads last month
22
Safetensors
Model size
125M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for marcoallanda/SubRoBERTa

Finetuned
(1624)
this model