fzkuji/pg19
The PG‑19 dataset is a text‑generation benchmark for language modeling, comprising books published before 1919 extracted from the Project Gutenberg library. It is more than twice the size of the Billion Word benchmark and its average document length is twenty times that of the WikiText benchmark. The dataset is split into training, validation, and test sets and includes metadata such as book titles and publication dates. It is a monolingual dataset containing only English text and is released under the Apache‑2.0 license.
Dataset description and usage context
Dataset Overview
Basic Information
- Name: PG-19
- Language: English
- License: Apache-2.0
- Multilinguality: Monolingual
- Source: Original data
- Task Category: Text Generation
- Task ID: Language Modeling
Dataset Size
- Download Size: 11.74 GB
- Dataset Size: 11.51 GB
Dataset Structure
- Features:
short_book_title: stringpublication_date: integerurl: stringtext: string
- Splits:
- Training: 28,602 samples
- Validation: 50 samples
- Test: 100 samples
Dataset Creation
- Annotation Creator: Expert generated
- Language Creator: Expert generated
Usage Considerations
- Not recommended for training general‑purpose language models (e.g., production dialogue agents) due to the historic language style and inherent biases of older texts.
Additional Information
-
Citation:
@article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }
-
Contributors: @thomwolf, @lewtun, @lucidrains, @lhoestq
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.