High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

fzkuji/pg19

The PG‑19 dataset is a text‑generation benchmark for language modeling, comprising books published before 1919 extracted from the Project Gutenberg library. It is more than twice the size of the Billion Word benchmark and its average document length is twenty times that of the WikiText benchmark. The dataset is split into training, validation, and test sets and includes metadata such as book titles and publication dates. It is a monolingual dataset containing only English text and is released under the Apache‑2.0 license.

hugging_face

View Details