roberta-base_topic_classification_nyt_news

This model is a fine-tuned version of roberta-base on the NYT News dataset, which contains 256,000 news titles from articles published from 2000 to the present (https://www.kaggle.com/datasets/aryansingh0909/nyt-articles-21m-2000-present). It achieves the following results on the test set of 51200 cases:

  • Accuracy: 0.91
  • F1: 0.91
  • Precision: 0.91
  • Recall: 0.91

Training data

Training data was classified as follow:

class Description
0 Sports
1 Arts, Culture, and Entertainment
2 Business and Finance
3 Health and Wellness
4 Lifestyle and Fashion
5 Science and Technology
6 Politics
7 Crime

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss Accuracy F1 Precision Recall
0.3192 1.0 20480 0.4078 0.8865 0.8859 0.8892 0.8865
0.2863 2.0 40960 0.4271 0.8972 0.8970 0.8982 0.8972
0.1979 3.0 61440 0.3797 0.9094 0.9092 0.9098 0.9094
0.1239 4.0 81920 0.3981 0.9117 0.9113 0.9114 0.9117
0.1472 5.0 102400 0.4033 0.9137 0.9135 0.9134 0.9137

Model performance

- precision recall f1 support
Sports 0.97 0.98 0.97 6400
Arts, Culture, and Entertainment 0.94 0.95 0.94 6400
Business and Finance 0.85 0.84 0.84 6400
Health and Wellness 0.90 0.93 0.91 6400
Lifestyle and Fashion 0.95 0.95 0.95 6400
Science and Technology 0.89 0.83 0.86 6400
Politics 0.93 0.88 0.90 6400
Crime 0.85 0.93 0.89 6400
accuracy 0.91 51200
macro avg 0.91 0.91 0.91 51200
weighted avg 0.91 0.91 0.91 51200

How to use roberta-base_topic_classification_nyt_news with HuggingFace

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing."
pipe(text)

[{'label': 'Sports', 'score': 0.9989326596260071}]

Framework versions

  • Transformers 4.32.1
  • Pytorch 2.1.0+cu121
  • Datasets 2.12.0
  • Tokenizers 0.13.2
Downloads last month
7,731
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for dstefa/roberta-base_topic_classification_nyt_news

Finetuned
(1375)
this model

Dataset used to train dstefa/roberta-base_topic_classification_nyt_news

Evaluation results