Dataset Card for Custom Text Dataset

Dataset Name

Custom Text Summarization Dataset (CNN/DailyMail Subset)

Overview

This dataset contains a subset of the CNN/DailyMail news dataset, which is used for training text summarization models. The dataset consists of articles paired with human-generated summaries. It is widely used in the development of natural language processing models for summarization tasks.

Number of examples: 287,113 (training set), 13,368 (validation set), 11,490 (test set)
Languages: English

Composition

Source: CNN and DailyMail news articles
Size: 1% subset of the full dataset
Text Fields: Each example consists of:
- article: The news article text
- highlights: The human-generated summary of the article

Collection Process

The dataset was collected by scraping news articles from CNN and DailyMail websites. The articles were paired with manually written summaries to form training examples. This dataset was originally prepared for the task of abstractive text summarization.

Preprocessing

Tokenization using a pretrained tokenizer (e.g., T5 tokenizer)
Maximum token length capped at 512 for both input and output sequences
Lowercasing of all texts to maintain consistency
Special tokens for start and end of sequences

How to Use

from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")

Evaluation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
BLEU (Bilingual Evaluation Understudy)

Limitations

Data bias: The dataset is composed of news articles from only two major sources, CNN and DailyMail, which may introduce a specific writing style and focus into the summaries.
Domain-specific issues: The dataset is limited to news articles and may not generalize well to other domains such as scientific texts or casual conversations.

Ethical Considerations

Privacy: Since the dataset consists of publicly available news articles, privacy concerns are minimal. However, users should be cautious when generating summaries for sensitive or private information.
Bias: News articles from CNN and DailyMail may reflect specific political or cultural biases, which could influence the summaries generated by models trained on this dataset.