Back to datasets
Dataset assetOpen Source CommunityLanguage ModelingLong‑Range Sequence Modeling

fzkuji/pg19

The PG‑19 dataset is a text‑generation benchmark for language modeling, comprising books published before 1919 extracted from the Project Gutenberg library. It is more than twice the size of the Billion Word benchmark and its average document length is twenty times that of the WikiText benchmark. The dataset is split into training, validation, and test sets and includes metadata such as book titles and publication dates. It is a monolingual dataset containing only English text and is released under the Apache‑2.0 license.

Source
hugging_face
Created
Nov 28, 2025
Updated
Apr 26, 2024
Signals
226 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Name: PG-19
  • Language: English
  • License: Apache-2.0
  • Multilinguality: Monolingual
  • Source: Original data
  • Task Category: Text Generation
  • Task ID: Language Modeling

Dataset Size

  • Download Size: 11.74 GB
  • Dataset Size: 11.51 GB

Dataset Structure

  • Features:
    • short_book_title: string
    • publication_date: integer
    • url: string
    • text: string
  • Splits:
    • Training: 28,602 samples
    • Validation: 50 samples
    • Test: 100 samples

Dataset Creation

  • Annotation Creator: Expert generated
  • Language Creator: Expert generated

Usage Considerations

  • Not recommended for training general‑purpose language models (e.g., production dialogue agents) due to the historic language style and inherent biases of older texts.

Additional Information

  • Citation:

    @article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }

  • Contributors: @thomwolf, @lewtun, @lucidrains, @lhoestq

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio