Explore high-quality datasets for your AI and machine learning projects.
XSum is an English news summarization dataset, the task is to predict the first sentence of an article based on the rest of the article. The dataset originates from BBC articles, language is British English, primarily used for abstractive summarization. The dataset structure includes document, summary, and ID fields, and is randomly split into training, validation, and test sets. The creators are from the University of Edinburgh, and the license is CC BY‑SA 4.0.
--- license: apache-2.0 dataset_info: features: - name: prompt dtype: string - name: summary_1 dtype: string - name: summary_2 dtype: string splits: - name: train num_bytes: 1697095 num_examples: 1000 download_size: 906302 dataset_size: 1697095 ---
The dataset is used for the QMSum task and contains two features: text content and answer length. It is split into a training set with 1,257 samples and a test set with 200 samples. The test set originates from the LongBench QMSum task, while the training set comes from the original QMSum repository. No built‑in validation set is provided; it is recommended to partition a portion of the training set for validation.