JUHE API Marketplace
DATASET
Open Source Community

fzkuji/pg19

The PG‑19 dataset is a text‑generation benchmark for language modeling, comprising books published before 1919 extracted from the Project Gutenberg library. It is more than twice the size of the Billion Word benchmark and its average document length is twenty times that of the WikiText benchmark. The dataset is split into training, validation, and test sets and includes metadata such as book titles and publication dates. It is a monolingual dataset containing only English text and is released under the Apache‑2.0 license.

Updated 4/26/2024
hugging_face

Description

Dataset Overview

Basic Information

  • Name: PG-19
  • Language: English
  • License: Apache-2.0
  • Multilinguality: Monolingual
  • Source: Original data
  • Task Category: Text Generation
  • Task ID: Language Modeling

Dataset Size

  • Download Size: 11.74 GB
  • Dataset Size: 11.51 GB

Dataset Structure

  • Features:
    • short_book_title: string
    • publication_date: integer
    • url: string
    • text: string
  • Splits:
    • Training: 28,602 samples
    • Validation: 50 samples
    • Test: 100 samples

Dataset Creation

  • Annotation Creator: Expert generated
  • Language Creator: Expert generated

Usage Considerations

  • Not recommended for training general‑purpose language models (e.g., production dialogue agents) due to the historic language style and inherent biases of older texts.

Additional Information

  • Citation:

    @article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }

  • Contributors: @thomwolf, @lewtun, @lucidrains, @lhoestq

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Language Modeling
Long‑Range Sequence Modeling

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.