Back to datasets
Dataset assetOpen Source CommunityText GenerationPoetry
Gutenberg Poetry Corpus
This is a poetry corpus extracted from Project Gutenberg, containing approximately three million lines of poetry, particularly suitable for creative computational poetry text generation applications.
Source
github
Created
Aug 13, 2018
Updated
May 7, 2024
Signals
238 views
Availability
Linked source ready
Overview
Dataset description and usage context
A Gutenberg Poetry Corpus Overview
Dataset Description
- Name: Gutenberg Poetry Corpus
- Creator: Allison Parrish
- Content: Approximately three million lines of poetry extracted from hundreds of books in Project Gutenberg.
- Format: Provided as gzip-compressed newline-delimited JSON.
- Structure: Each line of poetry is represented by a JSON object containing an
skey (poetry text) and agidkey (source book ID). - Usage: Particularly suitable for creative computational poetry text generation.
Usage
- Data Access: Obtain the dataset via the download link.
- Data Processing: Process the JSON format directly using programming languages such as Python.
- Example: Provides a Quick Experiments notebook demonstrating how to quickly use the dataset in Python.
Dataset Construction
- Generation Process: Uses the
build.pyscript to access Project Gutenberg books via Gutenberg, dammit, selecting books with thepoetrysubject and extracting poetry lines based on textual features. - Filtering Mechanism: Employs wordfilter to exclude lines that may contain offensive content.
Notes
- Content Review: The dataset has not undergone individual review of each poetry line; users must ensure suitability themselves.
- Copyright Status: The dataset includes only poetry lines from English books that are in the public domain in the United States.
License
- Data: Released under the CC0 Public Domain Dedication.
- Code: Released under the MIT License.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.