Back to datasets
Dataset assetOpen Source CommunityText GenerationPoetry

Gutenberg Poetry Corpus

This is a poetry corpus extracted from Project Gutenberg, containing approximately three million lines of poetry, particularly suitable for creative computational poetry text generation applications.

Source
github
Created
Aug 13, 2018
Updated
May 7, 2024
Signals
238 views
Availability
Linked source ready
Overview

Dataset description and usage context

A Gutenberg Poetry Corpus Overview

Dataset Description

  • Name: Gutenberg Poetry Corpus
  • Creator: Allison Parrish
  • Content: Approximately three million lines of poetry extracted from hundreds of books in Project Gutenberg.
  • Format: Provided as gzip-compressed newline-delimited JSON.
  • Structure: Each line of poetry is represented by a JSON object containing an s key (poetry text) and a gid key (source book ID).
  • Usage: Particularly suitable for creative computational poetry text generation.

Usage

  • Data Access: Obtain the dataset via the download link.
  • Data Processing: Process the JSON format directly using programming languages such as Python.
  • Example: Provides a Quick Experiments notebook demonstrating how to quickly use the dataset in Python.

Dataset Construction

  • Generation Process: Uses the build.py script to access Project Gutenberg books via Gutenberg, dammit, selecting books with the poetry subject and extracting poetry lines based on textual features.
  • Filtering Mechanism: Employs wordfilter to exclude lines that may contain offensive content.

Notes

  • Content Review: The dataset has not undergone individual review of each poetry line; users must ensure suitability themselves.
  • Copyright Status: The dataset includes only poetry lines from English books that are in the public domain in the United States.

License

  • Data: Released under the CC0 Public Domain Dedication.
  • Code: Released under the MIT License.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio