SubRoBERTa: Reddit Subreddit Classification Model

This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.

Model Description

Model type: RoBERTa-base fine-tuned for sequence classification
Language: English
License: MIT
Finetuned from model: roberta-base

Intended Uses & Limitations

This model is intended to be used for:

Classifying text into one of the following subreddits:
- r/aitah
- r/buildapc
- r/dating_advice
- r/legaladvice
- r/minecraft
- r/nostupidquestions
- r/pcmasterrace
- r/relationship_advice
- r/techsupport
- r/teenagers

Limitations

The model was trained on English text only
Performance may vary for texts that are significantly different from the training data
The model may not perform well on texts that don't clearly belong to any of the target subreddits

Usage

Here's how to use the model:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load model and tokenizer
model_name = "marcoallanda/SubRoBERTa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "My computer won't turn on, what should I do?"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=-1)
    pred_id = torch.argmax(probs, dim=-1).item()
    pred_label = model.config.id2label[pred_id]

print(f"Predicted subreddit: {pred_label}")

Training and Evaluation Data

The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.

Training Procedure

Training regime: Fine-tuning
Learning rate: 2e-5
Number of epochs: 10
Batch size: 128
Optimizer: AdamW
Mixed precision: FP16

Training Results

The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.

Citation

If you use this model in your research, please cite:

@misc{SubRoBERTa,
  author = {Marco Allanda},
  title = {SubRoBERTa: Reddit Subreddit Classification Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
}

marcoallanda
/

SubRoBERTa