JUHE API Marketplace
DATASET
Open Source Community

Gutenberg Poetry Corpus

This is a poetry corpus extracted from Project Gutenberg, containing approximately three million lines of poetry, particularly suitable for creative computational poetry text generation applications.

Updated 5/7/2024
github

Description

A Gutenberg Poetry Corpus Overview

Dataset Description

  • Name: Gutenberg Poetry Corpus
  • Creator: Allison Parrish
  • Content: Approximately three million lines of poetry extracted from hundreds of books in Project Gutenberg.
  • Format: Provided as gzip-compressed newline-delimited JSON.
  • Structure: Each line of poetry is represented by a JSON object containing an s key (poetry text) and a gid key (source book ID).
  • Usage: Particularly suitable for creative computational poetry text generation.

Usage

  • Data Access: Obtain the dataset via the download link.
  • Data Processing: Process the JSON format directly using programming languages such as Python.
  • Example: Provides a Quick Experiments notebook demonstrating how to quickly use the dataset in Python.

Dataset Construction

  • Generation Process: Uses the build.py script to access Project Gutenberg books via Gutenberg, dammit, selecting books with the poetry subject and extracting poetry lines based on textual features.
  • Filtering Mechanism: Employs wordfilter to exclude lines that may contain offensive content.

Notes

  • Content Review: The dataset has not undergone individual review of each poetry line; users must ensure suitability themselves.
  • Copyright Status: The dataset includes only poetry lines from English books that are in the public domain in the United States.

License

  • Data: Released under the CC0 Public Domain Dedication.
  • Code: Released under the MIT License.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Poetry
Text Generation

Source

Organization: github

Created: 8/13/2018

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.