High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

GEM/xsum

XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.

hugging_face

View Details

JuanKO/T5_summarization_RLAIF

Text Summarization

Reinforcement Learning

--- license: apache-2.0 dataset_info: features: - name: prompt dtype: string - name: summary_1 dtype: string - name: summary_2 dtype: string splits: - name: train num_bytes: 1697095 num_examples: 1000 download_size: 906302 dataset_size: 1697095 ---

hugging_face

View Details

qmsum

Text Summarization

Question Answering Systems

The dataset is used for the QMSum task and contains two features: text content and answer length. It is split into a training set with 1,257 samples and a test set with 200 samples. The test set originates from the LongBench QMSum task, while the training set comes from the original QMSum repository. No built‑in validation set is provided; it is recommended to partition a portion of the training set for validation.

huggingface

View Details