roberta-base_topic_classification_nyt_news

This model is a fine-tuned version of roberta-base on the NYT News dataset, which contains 256,000 news titles from articles published from 2000 to the present (https://www.kaggle.com/datasets/aryansingh0909/nyt-articles-21m-2000-present). It achieves the following results on the test set of 51200 cases:

Accuracy: 0.91
F1: 0.91
Precision: 0.91
Recall: 0.91

Training data

Training data was classified as follow:

class	Description
0	Sports
1	Arts, Culture, and Entertainment
2	Business and Finance
3	Health and Wellness
4	Lifestyle and Fashion
5	Science and Technology
6	Politics
7	Crime

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	F1	Precision	Recall
0.3192	1.0	20480	0.4078	0.8865	0.8859	0.8892	0.8865
0.2863	2.0	40960	0.4271	0.8972	0.8970	0.8982	0.8972
0.1979	3.0	61440	0.3797	0.9094	0.9092	0.9098	0.9094
0.1239	4.0	81920	0.3981	0.9117	0.9113	0.9114	0.9117
0.1472	5.0	102400	0.4033	0.9137	0.9135	0.9134	0.9137

Model performance

-	precision	recall	f1	support
Sports	0.97	0.98	0.97	6400
Arts, Culture, and Entertainment	0.94	0.95	0.94	6400
Business and Finance	0.85	0.84	0.84	6400
Health and Wellness	0.90	0.93	0.91	6400
Lifestyle and Fashion	0.95	0.95	0.95	6400
Science and Technology	0.89	0.83	0.86	6400
Politics	0.93	0.88	0.90	6400
Crime	0.85	0.93	0.89	6400

accuracy			0.91	51200
macro avg	0.91	0.91	0.91	51200
weighted avg	0.91	0.91	0.91	51200

How to use roberta-base_topic_classification_nyt_news with HuggingFace

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing."
pipe(text)

[{'label': 'Sports', 'score': 0.9989326596260071}]

Framework versions

Transformers 4.32.1
Pytorch 2.1.0+cu121
Datasets 2.12.0
Tokenizers 0.13.2

dstefa
/

roberta-base_topic_classification_nyt_news