fzkuji/pg19
The PG‑19 dataset is a text‑generation benchmark for language modeling, comprising books published before 1919 extracted from the Project Gutenberg library. It is more than twice the size of the Billion Word benchmark and its average document length is twenty times that of the WikiText benchmark. The dataset is split into training, validation, and test sets and includes metadata such as book titles and publication dates. It is a monolingual dataset containing only English text and is released under the Apache‑2.0 license.
Description
Dataset Overview
Basic Information
- Name: PG-19
- Language: English
- License: Apache-2.0
- Multilinguality: Monolingual
- Source: Original data
- Task Category: Text Generation
- Task ID: Language Modeling
Dataset Size
- Download Size: 11.74 GB
- Dataset Size: 11.51 GB
Dataset Structure
- Features:
short_book_title: stringpublication_date: integerurl: stringtext: string
- Splits:
- Training: 28,602 samples
- Validation: 50 samples
- Test: 100 samples
Dataset Creation
- Annotation Creator: Expert generated
- Language Creator: Expert generated
Usage Considerations
- Not recommended for training general‑purpose language models (e.g., production dialogue agents) due to the historic language style and inherent biases of older texts.
Additional Information
-
Citation:
@article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }
-
Contributors: @thomwolf, @lewtun, @lucidrains, @lhoestq
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.