Dataset assetOpen Source CommunityLanguage ModelingLong‑Range Sequence Modeling

fzkuji/pg19

The PG‑19 dataset is a text‑generation benchmark for language modeling, comprising books published before 1919 extracted from the Project Gutenberg library. It is more than twice the size of the Billion Word benchmark and its average document length is twenty times that of the WikiText benchmark. The dataset is split into training, validation, and test sets and includes metadata such as book titles and publication dates. It is a monolingual dataset containing only English text and is released under the Apache‑2.0 license.

Source

hugging_face

Created

Nov 28, 2025

Updated

Apr 26, 2024

Signals

226 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Information

Name: PG-19
Language: English
License: Apache-2.0
Multilinguality: Monolingual
Source: Original data
Task Category: Text Generation
Task ID: Language Modeling

Dataset Size

Download Size: 11.74 GB
Dataset Size: 11.51 GB

Dataset Structure

Features:
- short_book_title: string
- publication_date: integer
- url: string
- text: string
Splits:
- Training: 28,602 samples
- Validation: 50 samples
- Test: 100 samples

Dataset Creation

Annotation Creator: Expert generated
Language Creator: Expert generated

Usage Considerations

Not recommended for training general‑purpose language models (e.g., production dialogue agents) due to the historic language style and inherent biases of older texts.

Additional Information

Citation:

@article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }
Contributors: @thomwolf, @lewtun, @lucidrains, @lhoestq

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio