DATASET
Open Source Community
Gutenberg Poetry Corpus
This is a poetry corpus extracted from Project Gutenberg, containing approximately three million lines of poetry, particularly suitable for creative computational poetry text generation applications.
Updated 5/7/2024
github
Description
A Gutenberg Poetry Corpus Overview
Dataset Description
- Name: Gutenberg Poetry Corpus
- Creator: Allison Parrish
- Content: Approximately three million lines of poetry extracted from hundreds of books in Project Gutenberg.
- Format: Provided as gzip-compressed newline-delimited JSON.
- Structure: Each line of poetry is represented by a JSON object containing an
skey (poetry text) and agidkey (source book ID). - Usage: Particularly suitable for creative computational poetry text generation.
Usage
- Data Access: Obtain the dataset via the download link.
- Data Processing: Process the JSON format directly using programming languages such as Python.
- Example: Provides a Quick Experiments notebook demonstrating how to quickly use the dataset in Python.
Dataset Construction
- Generation Process: Uses the
build.pyscript to access Project Gutenberg books via Gutenberg, dammit, selecting books with thepoetrysubject and extracting poetry lines based on textual features. - Filtering Mechanism: Employs wordfilter to exclude lines that may contain offensive content.
Notes
- Content Review: The dataset has not undergone individual review of each poetry line; users must ensure suitability themselves.
- Copyright Status: The dataset includes only poetry lines from English books that are in the public domain in the United States.
License
- Data: Released under the CC0 Public Domain Dedication.
- Code: Released under the MIT License.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Poetry
Text Generation
Source
Organization: github
Created: 8/13/2018
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.